Executive Summary: Customer Segmentation Project¶
Most Important Findings: This project aims to leverage unsupervised learning (dimensionality reduction and clustering) to segment customers based on their characteristics and engagement with marketing campaigns. The analysis will identify key customer segments, potentially revealing distinct behavioral patterns and preferences. The impact of segmentation on marketing ROI will be a crucial outcome. Further detail on specific findings is pending completion of the analysis outlined in the notebook.
Proposed Model Specifications: Model specifications will be determined based on analysis results. Unsupervised learning, potentially using PCA for dimensionality reduction and K-means or hierarchical clustering, will be employed. Algorithm selection and hyperparameter tuning will be based on performance metrics (e.g., silhouette score, Davies-Bouldin index). Feature selection will be crucial.
Key Next Steps:
- Data Preparation: Further EDA, outlier detection, missing value imputation, and feature engineering based around desired metrics.
- Dimensionality Reduction: Apply PCA and find the optimal the number of components.
- Clustering: Use clustering algorithms to form customer segments; optimize cluster number and evaluate segmentation quality. Invesitgate the PCA stripes mentioned below in further detail.
- Segment Profiling: Analyze segment characteristics (demographics, purchase history, campaign engagement) and create segment profiles around marketing goals.
- Model Evaluation and Refinement: Evaluate model performance and refine algorithms and hyperparameters based on key marketing metrics.
- Actionable Insights: Develop recommendations for targeted marketing, personalized offers, and improved customer engagement.
- Documentation: Document the entire process, including rationale, results, and conclusions to present to clients.
Problem Summary: The business needs a better understanding of its customer base to improve marketing ROI. Currently, there's a lack of customer segmentation, leading to inefficient and potentially ineffective marketing campaigns. Targeted marketing efforts are hampered by an inability to identify distinct customer groups with varying needs and preferences.
Proposed Solution Design Rationale: Unsupervised machine learning (specifically dimensionality reduction and clustering) will be used to segment customers based on their observed characteristics and behaviors. This approach allows the identification of inherent groupings within the customer data without pre-defined labels, thus uncovering hidden patterns. Dimensionality reduction techniques like PCA will simplify the data, while clustering algorithms (like K-means or hierarchical clustering) will group customers with similar characteristics into segments.
Business Impact: Effective customer segmentation will lead to improved marketing ROI by enabling targeted campaigns. By tailoring marketing messages and offers to specific customer segments, the business can expect higher conversion rates and increased customer engagement. The identification of distinct customer profiles will allow for more personalized experiences, leading to greater customer satisfaction and loyalty. Ultimately, the project aims to drive revenue growth by optimizing marketing spend and improving overall customer relationship management.
Recommendations for Implementation: Customer Segmentation Project¶
Key Recommendations:
- Prioritize Data Quality: Focus on thorough data cleaning and preparation. This includes robust outlier detection and handling, comprehensive missing value imputation strategies (justify the methods used), and thoughtful feature engineering aligned with business objectives and marketing goals. The success of the model heavily depends on the quality of the input data.
- Feature Selection Rigor: Carefully select relevant features for segmentation. Avoid using irrelevant or redundant features, which can negatively impact the performance of the dimensionality reduction and clustering algorithms. Employ feature importance analysis techniques to guide selection. Document rationale for inclusion/exclusion of each feature.
- Clustering Algorithm Validation: Experiment with various clustering algorithms (K-means, hierarchical, DBSCAN, etc.) and rigorously compare their performance. Evaluate using silhouette scores and visual representations. Consider the interpretability of each algorithm's output.
- Segment Profiling Depth: Go beyond basic descriptive statistics when profiling segments. Identify actionable insights within each segment. Visualize the distributions of key variables within each segment to highlight differentiators. Quantify the impact of segment differences on marketing ROI.
- Iterative Refinement: Embrace an iterative approach. Continuously evaluate the model's performance using relevant business metrics (e.g., conversion rates, customer lifetime value) and refine the model (parameters and features) based on real-world feedback and evolving business needs.
- Robust Documentation: Document every step of the process, including data preprocessing techniques, algorithm choices, hyperparameter settings, evaluation metrics, segment profiles, and actionable insights. This will be critical for communicating the results and their implications to stakeholders.
- Data Processing Code data so that it is interpretable by the various dimensionality reduction and clustering models.
Key Actionables for Stakeholders:
- Marketing Team: Provide feedback on segment profiles and validate insights against their domain expertise. Participate in the development of targeted marketing strategies.
- Sales Team: Utilize the segment profiles to tailor sales approaches and personalize customer interactions.
- Executive Leadership: Review the business impact analysis and approve budget for implementation of the recommendations.
Expected Benefits and Costs:
- Benefits:
- Increased Marketing ROI: Improved targeting leading to higher conversion rates (assume a 10-15% increase in conversion rates for targeted segments).
- Improved Customer Engagement: Personalized offers leading to increased customer satisfaction and loyalty (assume a 5-10% increase in customer retention).
- Enhanced Customer Lifetime Value: Stronger customer relationships lead to increased revenue over the customer's lifetime. (Quantify this using past data if available).
- Optimized Marketing Spend: More efficient resource allocation due to better targeting.
- Costs:
- Data Acquisition/Preparation: Time investment for data cleaning, preprocessing, and feature engineering (estimate hours and associated labor costs).
- Model Development & Maintenance: Time investment for model building, evaluation, and deployment; ongoing maintenance costs. (Estimate hours and associated labor costs).
- IT Infrastructure: Potential investment in cloud resources or software licenses (estimate recurring cost).
- Bad Data Data may not always be accurate and may mislead decision makers leading a loss of sales and revenue.
Rational Assumptions (example):
- A 10% increase in conversion rates for targeted campaigns translates to an additional $X in revenue (estimate X based on current conversion rates and average order value).
- A 5% increase in customer retention reduces customer acquisition cost by $Y. (Estimate Y based on historical data).
- Each data scientist spends 'Z' hours per week on the project over 'N' weeks for a labor cost of $A. (Estimate hours, cost per hour, and timeframe).
Potential Risks and Challenges:
- Data quality issues: Inaccurate or incomplete data could lead to unreliable segments.
- Model complexity: Overly complex models might be difficult to interpret or deploy.
- Resistance to change: Stakeholders might be hesitant to adopt new strategies.
- Interpretability challenges: Difficulty in explaining the logic behind the segments to stakeholders.
Further Analysis:
- Competitor analysis: Analyze competitor segmentation strategies.
- Longitudinal analysis: Track segment performance over time to assess the stability of the segments and adapt strategies.
- A/B testing: Rigorously test marketing campaigns targeted at different segments.
- Causality vs. Correlation: While the model identifies patterns, further analysis needs to verify the causal relationship between segmentation and business outcomes. Consider potential confounding factors.
Note to readers: This detailed implementation plan should help guide the project to successful completion and provide measurable business value. Remember to adjust assumptions and quantifiable estimations based on specific business circumstances.
Capstone Project¶
Customer Segmentation¶
Context¶
Customer segmentation is the process of dividing a dataset of customers into groups of similar customers based on certain common characteristics, usually for the purpose of understanding the population dataset in a better fashion. Understanding customer behavior and characteristics is usually a critical part of the marketing operations of any business or organization, with direct consequences on sales & marketing strategy. Customer segmentation is often viewed as a means to achieve a better return on investment from marketing efforts, and make organizations more efficient in terms of utilizing their money, time, and other critical resources in custom marketing strategies for different groups of customers based on their unique needs and motivations.
For example, it has been understood from various research that customer segmentation often has a huge impact on people's email engagement. Segmented campaigns often see over 100% more clicks than non-segmented campaigns, and email marketers who have segmented their audience before campaigning have reported a 6-7 times growth in their overall revenue. It has also been observed in various contexts that in today's world, individual customers prefer personalized communications and offerings that cater to their particular interests.
In the context of marketing analytics then, customer segmentation has a vital role to play in optimizing ROI. It typically involves analyzing metrics around customer engagement with various marketing activities including but not limited to, ATL (above the line) marketing activities, BTL (below the line) campaigns, and targeting personalized offers. Typically, the variables of interest are customer profiles, campaign conversion rates, and information associated with various marketing channels. Based on these feature categories, the target is to create the best possible customer segments from the given data.
Objective¶
Using Unsupervised Learning ideas such as Dimensionality Reduction and Clustering, the objective is to come up with the best possible customer segments using the given customer dataset.
Marketing Campaign Analysis¶
Problem Definition¶
The Context:¶
- Why is this problem important to solve?
Effective customer segmentation is crucial for optimizing marketing ROI and improving customer engagement. Businesses invest significant resources in marketing campaigns, and without understanding customer behavior and preferences, these efforts may be inefficient or ineffective. Personalized marketing and targeted offers based on customer segments can lead to higher conversion rates, increased revenue, and stronger customer relationships. Furthermore, understanding different customer segments can inform product development, pricing strategies, and overall business strategy. Failing to segment customers risks wasting marketing budget and missing opportunities to tailor offerings to specific customer needs, leading to a competitive disadvantage.
The objective:¶
- What is the intended goal?
The primary goal is to develop a robust customer segmentation model using unsupervised learning techniques. This model should effectively group customers based on their characteristics and interactions with marketing campaigns, ultimately enabling more effective and efficient marketing strategies. The model's success will be measured by its ability to identify distinct customer segments with clear behavioral patterns, leading to actionable insights for personalized offers and targeted campaigns that improve key marketing metrics like ROI and customer engagement.
The key questions:¶
- What are the key questions that need to be answered?
- How many distinct customer segments exist within the dataset?
- What are the key characteristics and behaviors that define each segment?
- How do these segments differ in their responses to marketing campaigns (e.g., conversion rates, click-through rates, engagement levels)?
- What are the optimal marketing strategies (messaging, channel selection, offer types) for each identified segment?
- What is the potential impact of segment-specific marketing strategies on key business metrics (e.g., revenue, customer lifetime value, customer retention)?
- What are the most relevant and least relevant metrics to measure customer engagement?
- How can we best visualize and interpret the segments, making them useful for business users?
The problem formulation:¶
- What is it that we are trying to solve using data science?
We are using unsupervised learning, specifically dimensionality reduction (e.g., PCA) and clustering (e.g., K-means, hierarchical clustering), to identify optimal customer segments from a given dataset. The dataset likely contains customer demographics, purchase history, and interactions with various marketing channels and campaigns. The model will aim to identify underlying patterns and relationships among these features to group customers into homogeneous segments. We'll evaluate the model using metrics like silhouette scores and Davies-Bouldin indexes, with a primary focus on generating actionable insights that enhance marketing ROI and customer engagement. The challenge lies in finding the right balance between the number of segments, the homogeneity of each segment, and the interpretability of segment profiles in the context of marketing strategy.
Data Dictionary¶
The dataset contains the following features:
- ID: Unique ID of each customer
- Year_Birth: Customer’s year of birth
- Education: Customer's level of education
- Marital_Status: Customer's marital status
- Kidhome: Number of small children in customer's household
- Teenhome: Number of teenagers in customer's household
- Income: Customer's yearly household income in USD
- Recency: Number of days since the last purchase
- Dt_Customer: Date of customer's enrollment with the company
- MntFishProducts: The amount spent on fish products in the last 2 years
- MntMeatProducts: The amount spent on meat products in the last 2 years
- MntFruits: The amount spent on fruits products in the last 2 years
- MntSweetProducts: Amount spent on sweet products in the last 2 years
- MntWines: The amount spent on wine products in the last 2 years
- MntGoldProds: The amount spent on gold products in the last 2 years
- NumDealsPurchases: Number of purchases made with discount
- NumCatalogPurchases: Number of purchases made using a catalog (buying goods to be shipped through the mail)
- NumStorePurchases: Number of purchases made directly in stores
- NumWebPurchases: Number of purchases made through the company's website
- NumWebVisitsMonth: Number of visits to the company's website in the last month
- AcceptedCmp1: 1 if customer accepted the offer in the first campaign, 0 otherwise
- AcceptedCmp2: 1 if customer accepted the offer in the second campaign, 0 otherwise
- AcceptedCmp3: 1 if customer accepted the offer in the third campaign, 0 otherwise
- AcceptedCmp4: 1 if customer accepted the offer in the fourth campaign, 0 otherwise
- AcceptedCmp5: 1 if customer accepted the offer in the fifth campaign, 0 otherwise
- Response: 1 if customer accepted the offer in the last campaign, 0 otherwise
- Complain: 1 If the customer complained in the last 2 years, 0 otherwise
Note: You can assume that the data is collected in the year 2025.
Import the necessary libraries and load the data¶
# Uninstall the specified packages
# !pip uninstall -y numpy pandas scikit-learn sklearn_extra
# Reinstall the specified packages, ensuring the correct order
!pip install numpy==1.24.3
Requirement already satisfied: numpy==1.24.3 in /usr/local/lib/python3.11/dist-packages (1.24.3)
import numpy as np # Import NumPy first
print(f"NumPy version: {np.__version__}")
NumPy version: 1.24.3
!pip install pandas==2.0.3
Requirement already satisfied: pandas==2.0.3 in /usr/local/lib/python3.11/dist-packages (2.0.3) Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.11/dist-packages (from pandas==2.0.3) (2.8.2) Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.11/dist-packages (from pandas==2.0.3) (2025.2) Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.11/dist-packages (from pandas==2.0.3) (2025.2) Requirement already satisfied: numpy>=1.21.0 in /usr/local/lib/python3.11/dist-packages (from pandas==2.0.3) (1.24.3) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.11/dist-packages (from python-dateutil>=2.8.2->pandas==2.0.3) (1.17.0)
!pip install scikit-learn
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.11/dist-packages (1.6.1) Requirement already satisfied: numpy>=1.19.5 in /usr/local/lib/python3.11/dist-packages (from scikit-learn) (1.24.3) Requirement already satisfied: scipy>=1.6.0 in /usr/local/lib/python3.11/dist-packages (from scikit-learn) (1.14.1) Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.11/dist-packages (from scikit-learn) (1.4.2) Requirement already satisfied: threadpoolctl>=3.1.0 in /usr/local/lib/python3.11/dist-packages (from scikit-learn) (3.6.0)
# Install a compatible version of sklearn_extra
!pip install scikit-learn-extra==0.3.0
Requirement already satisfied: scikit-learn-extra==0.3.0 in /usr/local/lib/python3.11/dist-packages (0.3.0) Requirement already satisfied: numpy>=1.13.3 in /usr/local/lib/python3.11/dist-packages (from scikit-learn-extra==0.3.0) (1.24.3) Requirement already satisfied: scipy>=0.19.1 in /usr/local/lib/python3.11/dist-packages (from scikit-learn-extra==0.3.0) (1.14.1) Requirement already satisfied: scikit-learn>=0.23.0 in /usr/local/lib/python3.11/dist-packages (from scikit-learn-extra==0.3.0) (1.6.1) Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.11/dist-packages (from scikit-learn>=0.23.0->scikit-learn-extra==0.3.0) (1.4.2) Requirement already satisfied: threadpoolctl>=3.1.0 in /usr/local/lib/python3.11/dist-packages (from scikit-learn>=0.23.0->scikit-learn-extra==0.3.0) (3.6.0)
# import sklearn_extra # Import sklearn_extra for its version
# import sklearn # Import sklearn for its version
import pandas as pd
import sklearn
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import DBSCAN
import scipy.cluster.hierarchy as sch
from scipy.spatial.distance import pdist
from scipy.cluster.hierarchy import cophenet
# Check installed versions
print(f"Scikit-learn version: {sklearn.__version__}") # Now sklearn is defined
# print(f"sklearn_extra version: {sklearn_extra.__version__}") # Now sklearn_extra is defined
Scikit-learn version: 1.6.1
import numpy as np
# np.import_array() # Force NumPy to import its array functions explicitly
import matplotlib.pyplot as plt
import seaborn as sns
from google.colab import drive
import datetime
from sklearn.preprocessing import StandardScaler
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn_extra.cluster import KMedoids
from scipy.cluster.hierarchy import fcluster # Import fcluster
from sklearn.mixture import GaussianMixture
import warnings
warnings.filterwarnings('ignore')
#loading the data
drive.mount('/content/drive')
# reading in the data and getting the overview
df_temp = pd.read_csv('/content/drive/MyDrive/marketing_campaign.csv')
df = df_temp.copy() # so as not to change the original data
Mounted at /content/drive
Data Overview¶
- Reading the dataset
- Understanding the shape of the dataset
- Checking the data types
- Checking for missing values
- Checking for duplicated values
- Drop the column which has no null values
df.head()
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | ... | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5524 | 1957 | Graduation | Single | 58138.0 | 0 | 0 | 04-09-2012 | 58 | 635 | ... | 10 | 4 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 1 | 2174 | 1954 | Graduation | Single | 46344.0 | 1 | 1 | 08-03-2014 | 38 | 11 | ... | 1 | 2 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 4141 | 1965 | Graduation | Together | 71613.0 | 0 | 0 | 21-08-2013 | 26 | 426 | ... | 2 | 10 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 6182 | 1984 | Graduation | Together | 26646.0 | 1 | 0 | 10-02-2014 | 26 | 11 | ... | 0 | 4 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5324 | 1981 | PhD | Married | 58293.0 | 1 | 0 | 19-01-2014 | 94 | 173 | ... | 3 | 6 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 27 columns
df.shape
(2240, 27)
# Checking the data types
print(df.dtypes)
df.info()
ID int64 Year_Birth int64 Education object Marital_Status object Income float64 Kidhome int64 Teenhome int64 Dt_Customer object Recency int64 MntWines int64 MntFruits int64 MntMeatProducts int64 MntFishProducts int64 MntSweetProducts int64 MntGoldProds int64 NumDealsPurchases int64 NumWebPurchases int64 NumCatalogPurchases int64 NumStorePurchases int64 NumWebVisitsMonth int64 AcceptedCmp3 int64 AcceptedCmp4 int64 AcceptedCmp5 int64 AcceptedCmp1 int64 AcceptedCmp2 int64 Complain int64 Response int64 dtype: object <class 'pandas.core.frame.DataFrame'> RangeIndex: 2240 entries, 0 to 2239 Data columns (total 27 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 2240 non-null int64 1 Year_Birth 2240 non-null int64 2 Education 2240 non-null object 3 Marital_Status 2240 non-null object 4 Income 2216 non-null float64 5 Kidhome 2240 non-null int64 6 Teenhome 2240 non-null int64 7 Dt_Customer 2240 non-null object 8 Recency 2240 non-null int64 9 MntWines 2240 non-null int64 10 MntFruits 2240 non-null int64 11 MntMeatProducts 2240 non-null int64 12 MntFishProducts 2240 non-null int64 13 MntSweetProducts 2240 non-null int64 14 MntGoldProds 2240 non-null int64 15 NumDealsPurchases 2240 non-null int64 16 NumWebPurchases 2240 non-null int64 17 NumCatalogPurchases 2240 non-null int64 18 NumStorePurchases 2240 non-null int64 19 NumWebVisitsMonth 2240 non-null int64 20 AcceptedCmp3 2240 non-null int64 21 AcceptedCmp4 2240 non-null int64 22 AcceptedCmp5 2240 non-null int64 23 AcceptedCmp1 2240 non-null int64 24 AcceptedCmp2 2240 non-null int64 25 Complain 2240 non-null int64 26 Response 2240 non-null int64 dtypes: float64(1), int64(23), object(3) memory usage: 472.6+ KB
df.isnull().sum()
| 0 | |
|---|---|
| ID | 0 |
| Year_Birth | 0 |
| Education | 0 |
| Marital_Status | 0 |
| Income | 24 |
| Kidhome | 0 |
| Teenhome | 0 |
| Dt_Customer | 0 |
| Recency | 0 |
| MntWines | 0 |
| MntFruits | 0 |
| MntMeatProducts | 0 |
| MntFishProducts | 0 |
| MntSweetProducts | 0 |
| MntGoldProds | 0 |
| NumDealsPurchases | 0 |
| NumWebPurchases | 0 |
| NumCatalogPurchases | 0 |
| NumStorePurchases | 0 |
| NumWebVisitsMonth | 0 |
| AcceptedCmp3 | 0 |
| AcceptedCmp4 | 0 |
| AcceptedCmp5 | 0 |
| AcceptedCmp1 | 0 |
| AcceptedCmp2 | 0 |
| Complain | 0 |
| Response | 0 |
df.duplicated().sum()
0
# dropping ID column
df = df.drop('ID', axis=1)
print("Column 'ID' dropped successfully.")
Column 'ID' dropped successfully.
Observations and Insights from the Data overview:¶
- Data set has 2240 Rows and 27 columns (26 after I dropped the ID column)
- Data set consists of columns of strings, integers and one float (income).
- Income is the only column with missing values (NaN).
- No duplicates
- Dropped ID column because it is not relevant to any models.
Exploratory Data Analysis (EDA)¶
- EDA is an important part of any project involving data.
- It is important to investigate and understand the data better before building a model with it.
- A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
- A thorough analysis of the data, in addition to the questions mentioned below, should be done.
Questions:
- What is the summary statistics of the data? Explore summary statistics for numerical variables and the categorical variables
- Find out number of unique observations in each category of categorical columns? Write your findings/observations/insights
- Are all categories different from each other or can we combine some categories? Is 2n Cycle different from Master?
- There are 8 categories in Marital_Status with some categories having very low count of less than 5. Can we combine these categories with other categories?
# Summary statistics for numerical variables
print(df.describe().T)
print("\n\n")
# Summary statistics for categorical variables
print(df.describe(include=['object']).T)
count mean std min 25% \
Year_Birth 2240.0 1968.805804 11.984069 1893.0 1959.00
Income 2216.0 52247.251354 25173.076661 1730.0 35303.00
Kidhome 2240.0 0.444196 0.538398 0.0 0.00
Teenhome 2240.0 0.506250 0.544538 0.0 0.00
Recency 2240.0 49.109375 28.962453 0.0 24.00
MntWines 2240.0 303.935714 336.597393 0.0 23.75
MntFruits 2240.0 26.302232 39.773434 0.0 1.00
MntMeatProducts 2240.0 166.950000 225.715373 0.0 16.00
MntFishProducts 2240.0 37.525446 54.628979 0.0 3.00
MntSweetProducts 2240.0 27.062946 41.280498 0.0 1.00
MntGoldProds 2240.0 44.021875 52.167439 0.0 9.00
NumDealsPurchases 2240.0 2.325000 1.932238 0.0 1.00
NumWebPurchases 2240.0 4.084821 2.778714 0.0 2.00
NumCatalogPurchases 2240.0 2.662054 2.923101 0.0 0.00
NumStorePurchases 2240.0 5.790179 3.250958 0.0 3.00
NumWebVisitsMonth 2240.0 5.316518 2.426645 0.0 3.00
AcceptedCmp3 2240.0 0.072768 0.259813 0.0 0.00
AcceptedCmp4 2240.0 0.074554 0.262728 0.0 0.00
AcceptedCmp5 2240.0 0.072768 0.259813 0.0 0.00
AcceptedCmp1 2240.0 0.064286 0.245316 0.0 0.00
AcceptedCmp2 2240.0 0.012946 0.113069 0.0 0.00
Complain 2240.0 0.009375 0.096391 0.0 0.00
Response 2240.0 0.149107 0.356274 0.0 0.00
50% 75% max
Year_Birth 1970.0 1977.00 1996.0
Income 51381.5 68522.00 666666.0
Kidhome 0.0 1.00 2.0
Teenhome 0.0 1.00 2.0
Recency 49.0 74.00 99.0
MntWines 173.5 504.25 1493.0
MntFruits 8.0 33.00 199.0
MntMeatProducts 67.0 232.00 1725.0
MntFishProducts 12.0 50.00 259.0
MntSweetProducts 8.0 33.00 263.0
MntGoldProds 24.0 56.00 362.0
NumDealsPurchases 2.0 3.00 15.0
NumWebPurchases 4.0 6.00 27.0
NumCatalogPurchases 2.0 4.00 28.0
NumStorePurchases 5.0 8.00 13.0
NumWebVisitsMonth 6.0 7.00 20.0
AcceptedCmp3 0.0 0.00 1.0
AcceptedCmp4 0.0 0.00 1.0
AcceptedCmp5 0.0 0.00 1.0
AcceptedCmp1 0.0 0.00 1.0
AcceptedCmp2 0.0 0.00 1.0
Complain 0.0 0.00 1.0
Response 0.0 0.00 1.0
count unique top freq
Education 2240 5 Graduation 1127
Marital_Status 2240 8 Married 864
Dt_Customer 2240 663 31-08-2012 12
Observations:¶
- 24 rows have income at null. Solution: remove those rows since there are very few of those.
df = df.dropna()
print("Rows with income at null dropped successfully.")
print(df.describe().T)
Rows with income at null dropped successfully.
count mean std min 25% \
Year_Birth 2216.0 1968.820397 11.985554 1893.0 1959.0
Income 2216.0 52247.251354 25173.076661 1730.0 35303.0
Kidhome 2216.0 0.441787 0.536896 0.0 0.0
Teenhome 2216.0 0.505415 0.544181 0.0 0.0
Recency 2216.0 49.012635 28.948352 0.0 24.0
MntWines 2216.0 305.091606 337.327920 0.0 24.0
MntFruits 2216.0 26.356047 39.793917 0.0 2.0
MntMeatProducts 2216.0 166.995939 224.283273 0.0 16.0
MntFishProducts 2216.0 37.637635 54.752082 0.0 3.0
MntSweetProducts 2216.0 27.028881 41.072046 0.0 1.0
MntGoldProds 2216.0 43.965253 51.815414 0.0 9.0
NumDealsPurchases 2216.0 2.323556 1.923716 0.0 1.0
NumWebPurchases 2216.0 4.085289 2.740951 0.0 2.0
NumCatalogPurchases 2216.0 2.671029 2.926734 0.0 0.0
NumStorePurchases 2216.0 5.800993 3.250785 0.0 3.0
NumWebVisitsMonth 2216.0 5.319043 2.425359 0.0 3.0
AcceptedCmp3 2216.0 0.073556 0.261106 0.0 0.0
AcceptedCmp4 2216.0 0.074007 0.261842 0.0 0.0
AcceptedCmp5 2216.0 0.073105 0.260367 0.0 0.0
AcceptedCmp1 2216.0 0.064079 0.244950 0.0 0.0
AcceptedCmp2 2216.0 0.013087 0.113672 0.0 0.0
Complain 2216.0 0.009477 0.096907 0.0 0.0
Response 2216.0 0.150271 0.357417 0.0 0.0
50% 75% max
Year_Birth 1970.0 1977.00 1996.0
Income 51381.5 68522.00 666666.0
Kidhome 0.0 1.00 2.0
Teenhome 0.0 1.00 2.0
Recency 49.0 74.00 99.0
MntWines 174.5 505.00 1493.0
MntFruits 8.0 33.00 199.0
MntMeatProducts 68.0 232.25 1725.0
MntFishProducts 12.0 50.00 259.0
MntSweetProducts 8.0 33.00 262.0
MntGoldProds 24.5 56.00 321.0
NumDealsPurchases 2.0 3.00 15.0
NumWebPurchases 4.0 6.00 27.0
NumCatalogPurchases 2.0 4.00 28.0
NumStorePurchases 5.0 8.00 13.0
NumWebVisitsMonth 6.0 7.00 20.0
AcceptedCmp3 0.0 0.00 1.0
AcceptedCmp4 0.0 0.00 1.0
AcceptedCmp5 0.0 0.00 1.0
AcceptedCmp1 0.0 0.00 1.0
AcceptedCmp2 0.0 0.00 1.0
Complain 0.0 0.00 1.0
Response 0.0 0.00 1.0
# Number of unique observations in each category of categorical columns
for col in df.select_dtypes(include=['object']):
print(f"\nColumn: {col}")
print(df[col].value_counts())
Column: Education
Education
Graduation 1116
PhD 481
Master 365
2n Cycle 200
Basic 54
Name: count, dtype: int64
Column: Marital_Status
Marital_Status
Married 857
Together 573
Single 471
Divorced 232
Widow 76
Alone 3
Absurd 2
YOLO 2
Name: count, dtype: int64
Column: Dt_Customer
Dt_Customer
31-08-2012 12
12-09-2012 11
14-02-2013 11
12-05-2014 11
20-08-2013 10
..
05-08-2012 1
18-11-2012 1
09-05-2014 1
26-06-2013 1
09-01-2014 1
Name: count, Length: 662, dtype: int64
Observations:¶
- Absurd and YOLO are unserious answers. Solution: drop those rows since they may not hold valid information.
- Alone could be Single, widowed or divorced can be merged. Solution: also drop those rows since there are only 3 of them.
# Dropping all rows where marital status is "Alone", "Absurd", or "YOLO"
marital_status_to_drop = ["Alone", "Absurd", "YOLO"]
df = df[~df["Marital_Status"].isin(marital_status_to_drop)]
print(f"Rows with marital status {marital_status_to_drop} dropped successfully.")
# Analysis of 'Marital_Status' column
print("\nAnalysis of 'Marital_Status' column:")
print(df['Marital_Status'].value_counts())
Rows with marital status ['Alone', 'Absurd', 'YOLO'] dropped successfully. Analysis of 'Marital_Status' column: Marital_Status Married 857 Together 573 Single 471 Divorced 232 Widow 76 Name: count, dtype: int64
# Replacing all occurrences of "2n Cycle" with "Master" in the "Education" column
df['Education'] = df['Education'].replace('2n Cycle', 'Master')
# Analysis of 'Education' column
print("\nAnalysis of 'Education' column:")
print(df['Education'].value_counts())
Analysis of 'Education' column: Education Graduation 1114 Master 563 PhD 478 Basic 54 Name: count, dtype: int64
df.shape
(2209, 26)
Observations:¶
- Basic might mean High school and Graduation might mean bachelor or vice versa. Hard to say what those categories represent, so leaving them as is.
- From 2240 rows down to 2209 after removing missing or bad data.
Univariate Analysis on Numerical and Categorical data¶
Univariate analysis is used to explore each variable in a data set, separately. It looks at the range of values, as well as the central tendency of the values. It can be done for both numerical and categorical variables.
- Plot histogram and box plot for different numerical features and understand how the data looks like.
- Explore the categorical variables like Education, Kidhome, Teenhome, Complain.
- A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
- A thorough analysis of the data, in addition to the questions mentioned below, should be done.
Leading Questions:
- How does the distribution of Income variable vary across the dataset?
- The histogram and the box plot are showing some extreme value on the right side of the distribution of the 'Income' feature. Can we consider them as outliers and remove or should we analyze these extreme values?
- There are only a few rows with extreme values for the Income variable. Is that enough information to treat (or not to treat) them? At what percentile the upper whisker lies?
# Plot histogram and box plot for numerical features
numerical_features = ['Year_Birth', 'Income', 'Recency', 'MntFishProducts', 'MntMeatProducts', 'MntFruits',
'MntSweetProducts', 'MntWines', 'MntGoldProds', 'NumDealsPurchases', 'NumCatalogPurchases',
'NumStorePurchases', 'NumWebPurchases', 'NumWebVisitsMonth']
for feature in numerical_features:
plt.figure(figsize=(12, 4))
#histogram and boxplot side by side
plt.subplot(1, 2, 1)
sns.histplot(df[feature], kde=True)
plt.title(f'Histogram of {feature}')
plt.subplot(1, 2, 2)
sns.boxplot(y=df[feature])
plt.title(f'Boxplot of {feature}')
plt.tight_layout()
plt.show()
Observations:¶
- Number of Web Visits and Number of Web Purchases per month show similar box plots, no surprise.
- Recency and number of Store purchases are the only ones without outliers.
- Recency seems uniformly distributed
- Most are right skewed, which is no surprise since there are no negative values and lots of crowding just above 0.
# Exploring the categorical variables like Education, Kidhome, Teenhome, Complain, and any others.
categorical_features = ['Education', 'Kidhome', 'Teenhome', 'Complain', 'Marital_Status']
for feature in categorical_features:
plt.figure(figsize=(8, 6))
sns.countplot(x=feature, data=df)
plt.title(f'Distribution of {feature}')
plt.xlabel(feature)
plt.ylabel('Count')
plt.xticks(rotation=45) # Rotate x-axis labels for better readability
plt.show()
print(f"\nValue counts for {feature}:\n{df[feature].value_counts()}\n")
Value counts for Education: Education Graduation 1114 Master 563 PhD 478 Basic 54 Name: count, dtype: int64
Value counts for Kidhome: Kidhome 0 1279 1 884 2 46 Name: count, dtype: int64
Value counts for Teenhome: Teenhome 0 1144 1 1014 2 51 Name: count, dtype: int64
Value counts for Complain: Complain 0 2188 1 21 Name: count, dtype: int64
Value counts for Marital_Status: Marital_Status Married 857 Together 573 Single 471 Divorced 232 Widow 76 Name: count, dtype: int64
Observations:¶
- Education is mostly "Graduation"
- Very few complaints which could imply, happy customers.
- Mostly 0 or 1 kid/teen at home.
- Mostly Together or Married
# Calculate the upper whisker for the 'Income' variable
Q1 = df['Income'].quantile(0.25)
Q3 = df['Income'].quantile(0.75)
IQR = Q3 - Q1
upper_whisker = Q3 + 1.5 * IQR
# Find the percentile of the upper whisker
income_lvl = df['Income'][df['Income'] <= upper_whisker].quantile(1)
print(f"The upper whisker for income lies at the {income_lvl:.2f} income.")
The upper whisker for income lies at the 113734.00 income.
# 1. How does the distribution of Income variable vary across the dataset?
# a: it varies fairly evenly except for a few outliers around 150,000.
# 2. The histogram and the box plot are showing some extreme value on the right side of the distribution of the 'Income' feature. Can we consider them as outliers and remove or should we analyze these extreme values?
# a: removing one, see below.
# 3. There are only a few rows with extreme values for the Income variable. Is that enough information to treat (or not to treat) them?
# a: We wont treat them since the remaining values are not very extreme.
# Analyze the distribution of 'Income'
plt.figure(figsize=(12, 4))
# Histogram
plt.subplot(1, 2, 1)
sns.histplot(df['Income'], kde=True)
plt.title('Distribution of Income')
# Box plot
plt.subplot(1, 2, 2)
sns.boxplot(y=df['Income'])
plt.title('Boxplot of Income')
plt.tight_layout()
plt.show()
# Further exploration and analysis of extreme income values
# examine rows with income above a certain threshold
extreme_income_threshold = income_lvl
extreme_income_rows = df[df['Income'] > extreme_income_threshold]
print("\nRows with extremely high incomes:")
extreme_income_rows
Rows with extremely high incomes:
| Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | MntFruits | ... | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 164 | 1973 | PhD | Married | 157243.0 | 0 | 1 | 01-03-2014 | 98 | 20 | 2 | ... | 22 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 617 | 1976 | PhD | Together | 162397.0 | 1 | 1 | 03-06-2013 | 31 | 85 | 1 | ... | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 655 | 1975 | Graduation | Divorced | 153924.0 | 0 | 0 | 07-02-2014 | 81 | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 687 | 1982 | PhD | Married | 160803.0 | 0 | 0 | 04-08-2012 | 21 | 55 | 16 | ... | 28 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1300 | 1971 | Master | Together | 157733.0 | 1 | 0 | 04-06-2013 | 37 | 39 | 1 | ... | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1653 | 1977 | Graduation | Together | 157146.0 | 0 | 0 | 29-04-2013 | 13 | 1 | 0 | ... | 28 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2132 | 1949 | PhD | Married | 156924.0 | 0 | 0 | 29-08-2013 | 85 | 2 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2233 | 1977 | Graduation | Together | 666666.0 | 1 | 0 | 02-06-2013 | 23 | 9 | 14 | ... | 1 | 3 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
8 rows × 26 columns
# There is only one value that is truly extreme which has the income at 666,666, This is likely a bad value given a coincidental series of the number 6, so we will remove that row.
df = df[df['Income'] != 666666.]
print("Row with income at 666,666.0 dropped successfully.")
# the remaining outliers are not very extreme, but there are so few of them so we will drop them too.
df = df[df['Income'] < income_lvl]
print("Rows with income above the upper whisker dropped successfully.")
Row with income at 666,666.0 dropped successfully. Rows with income above the upper whisker dropped successfully.
# Continue exploring numerical features with more detailed statistics
for feature in numerical_features:
# Skewness
print(f"Skewness of {feature}: {df[feature].skew()}")
print("-" * 30)
Skewness of Year_Birth: -0.3527245697160486 ------------------------------ Skewness of Income: 0.004845021990835766 ------------------------------ Skewness of Recency: -0.0030723804940621124 ------------------------------ Skewness of MntFishProducts: 1.912855852859832 ------------------------------ Skewness of MntMeatProducts: 1.8150293359162661 ------------------------------ Skewness of MntFruits: 2.0982547491943238 ------------------------------ Skewness of MntSweetProducts: 2.0636987685700747 ------------------------------ Skewness of MntWines: 1.1650323432442231 ------------------------------ Skewness of MntGoldProds: 1.8328376708426588 ------------------------------ Skewness of NumDealsPurchases: 2.3254914645573184 ------------------------------ Skewness of NumCatalogPurchases: 1.3664475612507232 ------------------------------ Skewness of NumStorePurchases: 0.7089275674031636 ------------------------------ Skewness of NumWebPurchases: 0.9956790673284797 ------------------------------ Skewness of NumWebVisitsMonth: 0.23431698182902091 ------------------------------
Observations:¶
- Mostly positively/right skewed.
- Year of birth is strongly negatively skewed, meaning that it is a younger crowd primarily, but upon further inspection it looks like there might be some bad data.
# Now display year of birth outliers and remove rows with those outliers
# Identify outliers in 'Year_Birth'
Q1_year = df['Year_Birth'].quantile(0.25)
Q3_year = df['Year_Birth'].quantile(0.75)
IQR_year = Q3_year - Q1_year
lower_bound_year = Q1_year - 1.5 * IQR_year
upper_bound_year = Q3_year + 1.5 * IQR_year
outliers_year = df[(df['Year_Birth'] < lower_bound_year) | (df['Year_Birth'] > upper_bound_year)]
print("Year of birth outliers:")
print(outliers_year.T)
# Remove rows with 'Year_Birth' outliers
df = df[(df['Year_Birth'] >= lower_bound_year) & (df['Year_Birth'] <= upper_bound_year)]
print("\nRows with 'Year_Birth' outliers removed.")
df.shape
Year of birth outliers:
192 239 339
Year_Birth 1900 1893 1899
Education Master Master PhD
Marital_Status Divorced Single Together
Income 36640.0 60182.0 83532.0
Kidhome 1 0 0
Teenhome 0 1 0
Dt_Customer 26-09-2013 17-05-2014 26-09-2013
Recency 99 23 36
MntWines 15 8 755
MntFruits 6 0 144
MntMeatProducts 8 5 562
MntFishProducts 7 7 104
MntSweetProducts 4 0 64
MntGoldProds 25 2 224
NumDealsPurchases 1 1 1
NumWebPurchases 2 1 4
NumCatalogPurchases 1 0 6
NumStorePurchases 2 2 4
NumWebVisitsMonth 5 4 1
AcceptedCmp3 0 0 0
AcceptedCmp4 0 0 0
AcceptedCmp5 0 0 1
AcceptedCmp1 0 0 0
AcceptedCmp2 0 0 0
Complain 1 0 0
Response 0 0 0
Rows with 'Year_Birth' outliers removed.
(2197, 26)
# Display rows where 'NumWebPurchases' > 'NumWebVisitsMonth'
flawed_data = df[df['NumWebPurchases'] > df['NumWebVisitsMonth']]
print("Rows where 'NumWebPurchases' > 'NumWebVisitsMonth':")
print(flawed_data.T)
# There are some rows where the number of Web Purchases exceeds the number of web visits.
# At first glance this seems flawed but these could be subscriptions or autopurchases.
Rows where 'NumWebPurchases' > 'NumWebVisitsMonth':
0 2 6 12 \
Year_Birth 1957 1965 1971 1959
Education Graduation Graduation Graduation Graduation
Marital_Status Single Together Divorced Divorced
Income 58138.0 71613.0 55635.0 63033.0
Kidhome 0 0 0 0
Teenhome 0 0 1 0
Dt_Customer 04-09-2012 21-08-2013 13-11-2012 15-11-2013
Recency 58 26 34 82
MntWines 635 426 235 194
MntFruits 88 49 65 61
MntMeatProducts 546 127 164 480
MntFishProducts 172 111 50 225
MntSweetProducts 88 21 49 112
MntGoldProds 88 42 27 30
NumDealsPurchases 3 1 4 1
NumWebPurchases 8 8 7 3
NumCatalogPurchases 10 2 3 4
NumStorePurchases 4 10 7 8
NumWebVisitsMonth 7 4 6 2
AcceptedCmp3 0 0 0 0
AcceptedCmp4 0 0 0 0
AcceptedCmp5 0 0 0 0
AcceptedCmp1 0 0 0 0
AcceptedCmp2 0 0 0 0
Complain 0 0 0 0
Response 1 0 0 0
15 18 23 29 \
Year_Birth 1946 1949 1954 1965
Education PhD Master PhD PhD
Marital_Status Single Married Married Married
Income 82800.0 76995.0 65324.0 84618.0
Kidhome 0 0 0 0
Teenhome 0 1 1 0
Dt_Customer 24-11-2012 28-03-2013 11-01-2014 22-11-2013
Recency 23 91 0 96
MntWines 1006 1012 384 684
MntFruits 22 80 0 100
MntMeatProducts 115 498 102 801
MntFishProducts 59 0 21 21
MntSweetProducts 68 16 32 66
MntGoldProds 45 176 5 0
NumDealsPurchases 1 2 3 1
NumWebPurchases 7 11 6 6
NumCatalogPurchases 6 4 2 9
NumStorePurchases 12 9 9 10
NumWebVisitsMonth 3 5 4 2
AcceptedCmp3 0 0 0 0
AcceptedCmp4 0 0 0 0
AcceptedCmp5 1 0 0 1
AcceptedCmp1 1 1 0 0
AcceptedCmp2 0 0 0 0
Complain 0 0 0 0
Response 1 0 0 0
36 39 ... 2213 2217 \
Year_Birth 1970 1943 ... 1995 1948
Education PhD PhD ... Master PhD
Marital_Status Married Divorced ... Single Widow
Income 67353.0 48948.0 ... 80617.0 82032.0
Kidhome 0 0 ... 0 0
Teenhome 1 0 ... 0 0
Dt_Customer 31-12-2013 01-02-2013 ... 12-10-2012 05-04-2014
Recency 37 53 ... 42 54
MntWines 702 437 ... 594 332
MntFruits 17 8 ... 51 194
MntMeatProducts 151 206 ... 631 377
MntFishProducts 0 160 ... 72 149
MntSweetProducts 8 49 ... 55 125
MntGoldProds 35 42 ... 32 57
NumDealsPurchases 5 2 ... 1 0
NumWebPurchases 5 7 ... 4 4
NumCatalogPurchases 6 10 ... 6 6
NumStorePurchases 12 5 ... 8 7
NumWebVisitsMonth 2 6 ... 2 1
AcceptedCmp3 0 1 ... 0 0
AcceptedCmp4 0 0 ... 0 0
AcceptedCmp5 0 0 ... 0 0
AcceptedCmp1 0 0 ... 0 0
AcceptedCmp2 0 0 ... 0 0
Complain 0 0 ... 0 0
Response 0 1 ... 0 0
2221 2224 2225 2226 \
Year_Birth 1982 1970 1968 1979
Education Master PhD Graduation Graduation
Marital_Status Single Single Together Married
Income 75777.0 57642.0 58554.0 63777.0
Kidhome 0 0 1 1
Teenhome 0 1 1 1
Dt_Customer 04-07-2013 09-01-2014 26-09-2012 24-03-2013
Recency 12 24 55 87
MntWines 712 580 368 457
MntFruits 26 6 24 5
MntMeatProducts 538 58 68 106
MntFishProducts 69 8 38 15
MntSweetProducts 13 0 0 17
MntGoldProds 80 27 88 53
NumDealsPurchases 1 3 6 8
NumWebPurchases 3 7 8 11
NumCatalogPurchases 6 6 2 1
NumStorePurchases 11 6 6 6
NumWebVisitsMonth 1 4 7 8
AcceptedCmp3 0 1 0 0
AcceptedCmp4 1 0 0 0
AcceptedCmp5 1 0 0 0
AcceptedCmp1 0 0 0 0
AcceptedCmp2 0 0 0 0
Complain 0 0 0 0
Response 1 0 0 0
2231 2235 2236 2238
Year_Birth 1970 1967 1946 1956
Education Master Graduation PhD Master
Marital_Status Single Married Together Together
Income 44802.0 61223.0 64014.0 69245.0
Kidhome 0 0 2 0
Teenhome 0 1 1 1
Dt_Customer 21-08-2012 13-06-2013 10-06-2014 24-01-2014
Recency 71 46 56 8
MntWines 853 709 406 428
MntFruits 10 43 0 30
MntMeatProducts 143 182 30 214
MntFishProducts 13 42 0 80
MntSweetProducts 10 118 0 30
MntGoldProds 20 247 8 61
NumDealsPurchases 2 2 7 2
NumWebPurchases 9 9 8 6
NumCatalogPurchases 4 3 2 5
NumStorePurchases 12 4 5 10
NumWebVisitsMonth 8 5 7 3
AcceptedCmp3 0 0 0 0
AcceptedCmp4 0 0 0 0
AcceptedCmp5 0 0 0 0
AcceptedCmp1 0 0 1 0
AcceptedCmp2 0 0 0 0
Complain 0 0 0 0
Response 0 0 0 0
[26 rows x 860 columns]
Bivariate Analysis¶
- Analyze different categorical and numerical variables and check how different variables are related to each other.
- Check the relationship of numerical variables with categorical variables.
# Analyze the relationship between 'Income' and 'MntWines'
plt.figure(figsize=(8, 6))
sns.scatterplot(x='Income', y='MntWines', data=df)
plt.title('Income vs. Amount Spent on Wine')
plt.xlabel('Income')
plt.ylabel('Amount Spent on Wine')
plt.show()
# Analyze the relationship between 'Income' and 'MntMeatProducts'
plt.figure(figsize=(8, 6))
sns.scatterplot(x='Income', y='MntMeatProducts', data=df)
plt.title('Income vs. Amount Spent on Meat Products')
plt.xlabel('Income')
plt.ylabel('Amount Spent on Meat Products')
plt.show()
# Analyze the relationship between 'Income' and 'MntFruits'
plt.figure(figsize=(8, 6))
sns.scatterplot(x='Income', y='MntFruits', data=df)
plt.title('Income vs. Amount Spent on Fruits')
plt.xlabel('Income')
plt.ylabel('Amount Spent on Fruits')
plt.show()
# Analyze the relationship between 'Income' and 'NumStorePurchases'
plt.figure(figsize=(8,6))
sns.boxplot(x='Education', y='Income', data=df)
plt.title('Income vs Education')
plt.show()
# Analyze the relationship between categorical variables and 'Response'
for col in ['Education', 'Marital_Status', 'Kidhome', 'Teenhome']:
plt.figure(figsize=(10, 6))
sns.countplot(x=col, hue='Response', data=df)
plt.title(f'Response Rate by {col}')
plt.xticks(rotation=45)
plt.show()
# Further bivariate analysis using other suitable plots like pairplot, heatmap etc.
# Example using heatmap for correlation matrix
plt.figure(figsize=(24, 20))
# Calculate the correlation matrix only for numerical features
numerical_df = df.select_dtypes(include=np.number) # Select only numerical columns
sns.heatmap(numerical_df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Matrix of Numerical Features')
plt.show()
Observations from Bivariate Analysis:¶
Income vs. Spending on Products:¶
- There's a positive correlation between income and the amount spent on wine, meat products, and to a lesser extent, fruits. Higher-income customers tend to spend more on these items. This suggests that income is a significant factor in purchasing behavior for these product categories.
- The relationship doesn't appear perfectly linear; there's some scatter, indicating other factors might influence spending beyond just income.
Income vs. Education:¶
- Income distribution varies across different education levels. "PhD" and "Master" levels seem to have higher income distributions, while others overlap more significantly. This suggests an influence of education level on income.
- "Basic" correlates with substantially less income.
Response Rate by Categorical Variables:¶
- Education: Response rates seem relatively consistent across education levels, suggesting education level might not be a strong predictor of campaign acceptance.
- Marital Status: There are subtle differences in response rates among different marital statuses, but no single status overwhelmingly accepts or rejects campaigns. More investigation might be needed.
- Kidhome/Teenhome: Customers with children or teenagers in their households might demonstrate different response patterns to campaigns. Further investigation into the variations could be worthwhile.
Correlation Matrix:¶
- The heatmap provides a visual overview of the correlations between numerical features. Strong positive correlations exist between spending on different product categories (e.g., wine, meat, gold products), indicating that customers who spend heavily in one category tend to spend more in others. This suggests that they might be higher value customers overall.
- Some other correlations are expected, for example, the positive correlation between number of web visits and web purchases makes sense.
Additional insights and potential further analysis:¶
- Segmentation: The bivariate analysis hints at potential customer segments based on income, education, and spending habits. Further segmentation analysis could reveal distinct customer groups that may respond differently to marketing campaigns.
- Interaction Effects: There may be potential interaction effects between variables (e.g., the relationship between income and wine spending might be different for various education levels).
- Outlier Impact: Assess how the removal of outliers affected the bivariate relationships. Some correlations could have changed because of outlier removal.
- Visualizations: More sophisticated visualizations (e.g., grouped boxplots, violin plots) could reveal additional trends and relationships between the variables.
- Statistical Tests: Statistical tests could be performed to determine the significance of correlations and differences in response rates between groups.
Feature Engineering and Data Processing¶
In this section, we will first prepare our dataset for analysis.
- Imputing missing values
Think About It:
- Can we extract the age of each customer and create a new feature?
- Can we find the total kids and teens in the home?
- Can we find out how many members each family has?
- Can we find the total amount spent by the customers on various products?
- Can we find out how long the customer has been with the company?
- Can we find out how many offers the customers have accepted?
- Can we find out amount spent per purchase?
# New Total Family size field
df['Tot_Family_Mem'] = 1 # Initialize with 1 for the individual
# Add 1 for spouse if married
df.loc[df['Marital_Status'] == 'Married', 'Tot_Family_Mem'] += 1
# Add the number of kids and teens
df['Tot_Family_Mem'] = df['Tot_Family_Mem'] + df['Kidhome'] + df['Teenhome']
# Calculate customer tenure
# Convert 'Dt_Customer' to datetime objects, specifying the correct format
df['Dt_Customer'] = pd.to_datetime(df['Dt_Customer'], format='%d-%m-%Y') #needded format='%d-%m-%Y'
# Calculate the difference between a reference date and the customer's enrollment date
reference_date = datetime.datetime(2025, 1, 1)
df['Customer_Tenure'] = (reference_date - df['Dt_Customer']).dt.days
# Now 'Customer_Tenure' contains the number of days each customer has been with the company
print(df[['Dt_Customer', 'Customer_Tenure']].head(10))
Dt_Customer Customer_Tenure 0 2012-09-04 4502 1 2014-03-08 3952 2 2013-08-21 4151 3 2014-02-10 3978 4 2014-01-19 4000 5 2013-09-09 4132 6 2012-11-13 4432 7 2013-05-08 4256 8 2013-06-06 4227 9 2014-03-13 3947
# Calculate the total number of accepted campaigns for each customer
df['TotalAcceptedCmp'] = df['AcceptedCmp1'] + df['AcceptedCmp2'] + df['AcceptedCmp3'] + df['AcceptedCmp4'] + df['AcceptedCmp5']
# Calculate the age of each customer
current_year = 2025
df['Age'] = current_year - df['Year_Birth']
# Total kids and teens at home
df['Total_Kids'] = df['Kidhome'] + df['Teenhome']
# Total amount spent on products
df['Total_Spending'] = df['MntWines'] + df['MntFruits'] + df['MntMeatProducts'] + df['MntFishProducts'] + df['MntSweetProducts'] + df['MntGoldProds']
# Total purchases
df['Total_Num_Purchases'] = df['NumWebPurchases'] + df['NumCatalogPurchases'] + df['NumStorePurchases']
# Drop rows where either total purchases or total spending is zero, but not both otherwise this will create divide by zero error
df = df[~((df['Total_Num_Purchases'] == 0) ^ (df['Total_Spending'] == 0))]
# Amount spent per purchase
df['Amount_Spent_Per_Purchase'] = df['Total_Spending'] / df['Total_Num_Purchases']
df['Amount_Spent_Per_Purchase'].fillna(0, inplace=True) # Handle potential division by zero
# Display the updated DataFrame
print(df.head(10).T)
0 1 \
Year_Birth 1957 1954
Education Graduation Graduation
Marital_Status Single Single
Income 58138.0 46344.0
Kidhome 0 1
Teenhome 0 1
Dt_Customer 2012-09-04 00:00:00 2014-03-08 00:00:00
Recency 58 38
MntWines 635 11
MntFruits 88 1
MntMeatProducts 546 6
MntFishProducts 172 2
MntSweetProducts 88 1
MntGoldProds 88 6
NumDealsPurchases 3 2
NumWebPurchases 8 1
NumCatalogPurchases 10 1
NumStorePurchases 4 2
NumWebVisitsMonth 7 5
AcceptedCmp3 0 0
AcceptedCmp4 0 0
AcceptedCmp5 0 0
AcceptedCmp1 0 0
AcceptedCmp2 0 0
Complain 0 0
Response 1 0
Tot_Family_Mem 1 3
Customer_Tenure 4502 3952
TotalAcceptedCmp 0 0
Age 68 71
Total_Kids 0 2
Total_Spending 1617 27
Total_Num_Purchases 22 4
Amount_Spent_Per_Purchase 73.5 6.75
2 3 \
Year_Birth 1965 1984
Education Graduation Graduation
Marital_Status Together Together
Income 71613.0 26646.0
Kidhome 0 1
Teenhome 0 0
Dt_Customer 2013-08-21 00:00:00 2014-02-10 00:00:00
Recency 26 26
MntWines 426 11
MntFruits 49 4
MntMeatProducts 127 20
MntFishProducts 111 10
MntSweetProducts 21 3
MntGoldProds 42 5
NumDealsPurchases 1 2
NumWebPurchases 8 2
NumCatalogPurchases 2 0
NumStorePurchases 10 4
NumWebVisitsMonth 4 6
AcceptedCmp3 0 0
AcceptedCmp4 0 0
AcceptedCmp5 0 0
AcceptedCmp1 0 0
AcceptedCmp2 0 0
Complain 0 0
Response 0 0
Tot_Family_Mem 1 2
Customer_Tenure 4151 3978
TotalAcceptedCmp 0 0
Age 60 41
Total_Kids 0 1
Total_Spending 776 53
Total_Num_Purchases 20 6
Amount_Spent_Per_Purchase 38.8 8.833333
4 5 \
Year_Birth 1981 1967
Education PhD Master
Marital_Status Married Together
Income 58293.0 62513.0
Kidhome 1 0
Teenhome 0 1
Dt_Customer 2014-01-19 00:00:00 2013-09-09 00:00:00
Recency 94 16
MntWines 173 520
MntFruits 43 42
MntMeatProducts 118 98
MntFishProducts 46 0
MntSweetProducts 27 42
MntGoldProds 15 14
NumDealsPurchases 5 2
NumWebPurchases 5 6
NumCatalogPurchases 3 4
NumStorePurchases 6 10
NumWebVisitsMonth 5 6
AcceptedCmp3 0 0
AcceptedCmp4 0 0
AcceptedCmp5 0 0
AcceptedCmp1 0 0
AcceptedCmp2 0 0
Complain 0 0
Response 0 0
Tot_Family_Mem 3 2
Customer_Tenure 4000 4132
TotalAcceptedCmp 0 0
Age 44 58
Total_Kids 1 1
Total_Spending 422 716
Total_Num_Purchases 14 20
Amount_Spent_Per_Purchase 30.142857 35.8
6 7 \
Year_Birth 1971 1985
Education Graduation PhD
Marital_Status Divorced Married
Income 55635.0 33454.0
Kidhome 0 1
Teenhome 1 0
Dt_Customer 2012-11-13 00:00:00 2013-05-08 00:00:00
Recency 34 32
MntWines 235 76
MntFruits 65 10
MntMeatProducts 164 56
MntFishProducts 50 3
MntSweetProducts 49 1
MntGoldProds 27 23
NumDealsPurchases 4 2
NumWebPurchases 7 4
NumCatalogPurchases 3 0
NumStorePurchases 7 4
NumWebVisitsMonth 6 8
AcceptedCmp3 0 0
AcceptedCmp4 0 0
AcceptedCmp5 0 0
AcceptedCmp1 0 0
AcceptedCmp2 0 0
Complain 0 0
Response 0 0
Tot_Family_Mem 2 3
Customer_Tenure 4432 4256
TotalAcceptedCmp 0 0
Age 54 40
Total_Kids 1 1
Total_Spending 590 169
Total_Num_Purchases 17 8
Amount_Spent_Per_Purchase 34.705882 21.125
8 9
Year_Birth 1974 1950
Education PhD PhD
Marital_Status Together Together
Income 30351.0 5648.0
Kidhome 1 1
Teenhome 0 1
Dt_Customer 2013-06-06 00:00:00 2014-03-13 00:00:00
Recency 19 68
MntWines 14 28
MntFruits 0 0
MntMeatProducts 24 6
MntFishProducts 3 1
MntSweetProducts 3 1
MntGoldProds 2 13
NumDealsPurchases 1 1
NumWebPurchases 3 1
NumCatalogPurchases 0 0
NumStorePurchases 2 0
NumWebVisitsMonth 9 20
AcceptedCmp3 0 1
AcceptedCmp4 0 0
AcceptedCmp5 0 0
AcceptedCmp1 0 0
AcceptedCmp2 0 0
Complain 0 0
Response 1 0
Tot_Family_Mem 2 3
Customer_Tenure 4227 3947
TotalAcceptedCmp 0 1
Age 51 75
Total_Kids 1 2
Total_Spending 46 49
Total_Num_Purchases 5 1
Amount_Spent_Per_Purchase 9.2 49.0
# Create a new column 'Marital_Status_Single'
df['Single'] = 0
# Set the value to 1 if Marital_Status is 'Single', 'Divorced', or 'Widow'
df.loc[df['Marital_Status'].isin(['Single', 'Divorced', 'Widow']), 'Single'] = 1
# Display the updated DataFrame with the new column
print(df[['Marital_Status', 'Single']].head(20))
Marital_Status Single 0 Single 1 1 Single 1 2 Together 0 3 Together 0 4 Married 0 5 Together 0 6 Divorced 1 7 Married 0 8 Together 0 9 Together 0 11 Married 0 12 Divorced 1 13 Divorced 1 14 Married 0 15 Single 1 16 Married 0 17 Together 0 18 Married 0 19 Single 1 20 Married 0
df["Well_Educated"] = (df["Education"] != "Basic").astype(int)
print(df[['Well_Educated',"Education"]].head(20))
Well_Educated Education 0 1 Graduation 1 1 Graduation 2 1 Graduation 3 1 Graduation 4 1 PhD 5 1 Master 6 1 Graduation 7 1 PhD 8 1 PhD 9 1 PhD 11 0 Basic 12 1 Graduation 13 1 Master 14 1 Graduation 15 1 PhD 16 1 Graduation 17 1 Graduation 18 1 Master 19 1 Master 20 1 Graduation
Important Insights from EDA and Data Preprocessing¶
What are the the most important observations and insights from the data based on the EDA and Data Preprocessing performed?
Data Cleaning:¶
- Missing 'Income' values were dropped due to their low count and the difficulty of accurate imputation.
- Inconsistent or nonsensical entries in 'Marital_Status' (e.g., "Alone", "Absurd", "YOLO") were removed as they represented a small portion of the data and likely held no meaningful information.
- The '2n Cycle' value in 'Education' was replaced with 'Master' for consistency.
- Extreme outliers were observed for the 'Income', some of which were dropped given their small number.
- Outliers in 'Year_Birth' were identified and removed, potentially due to data entry errors.
- Rows where the number of Web Purchases was greater than Web Visits were retained, as these could be explained by auto purchases or subscriptions.
Data Distribution:¶
- Most numerical features exhibited positive skew, common for non-negative values with a concentration near zero.
- 'Recency' had a somewhat uniform distribution.
- 'Year_Birth' showed a strong negative skew and was cleaned to remove outliers.
Feature Relationships:¶
- Positive correlation was found between income and the amount spent on wines, meat products, and fruits.
- 'Education' level seemed to influence 'Income' distribution, with higher income levels associated with Master and PhD degrees and much lower income level associated with Basic education.
- The number of web visits and purchases correlated strongly.
- Correlation between spending on different product categories was observed (suggesting high-value customers).
- Response rates showed subtle variations across different categories like Marital Status and presence of kids/teens.
Feature Engineering:¶
- New features were created, enhancing the data's analytical value:
- 'Tot_Family_Mem': Total family members
- 'Customer_Tenure': Length of time as a customer
- 'TotalAcceptedCmp': Total number of offers accepted
- 'Age': Customer age
- 'Total_Kids': Total number of children and teenagers
- 'Total_Spending': Total expenditure on all products
- 'Total_Purchases': Sum of purchases across different channels
- 'Amount_Spent_Per_Purchase': Spending per purchase.
Overall the data processing involved addressing outliers, handling inconsistent values, and creating useful features to better understand customer behavior. Further investigation could reveal more nuanced insights.
Data Preparation for Segmentation¶
- The decision about which variables to use for clustering is a critically important decision that will have a big impact on the clustering solution. So we need to think carefully about the variables we will choose for clustering. Clearly, this is a step where a lot of contextual knowledge, creativity, and experimentation/iterations are needed.
- Moreover, we often use only a few of the data attributes for segmentation (the segmentation attributes) and use some of the remaining ones (the profiling attributes) only to profile the clusters. For example, in market research and market segmentation, we can use behavioral data for segmentation (to segment the customers based on their behavior like amount spent, units bought, etc.), and then use both demographic as well as behavioral data for profiling the segments found.
- Plot the correlation plot after we've removed the irrelevant variables
- Scale the Data
df.describe().T
| count | mean | min | 25% | 50% | 75% | max | std | |
|---|---|---|---|---|---|---|---|---|
| Year_Birth | 2193.0 | 1968.901961 | 1940.0 | 1959.0 | 1970.0 | 1977.0 | 1996.0 | 11.700839 |
| Income | 2193.0 | 51676.200182 | 2447.0 | 35246.0 | 51369.0 | 68316.0 | 105471.0 | 20607.121475 |
| Kidhome | 2193.0 | 0.442316 | 0.0 | 0.0 | 0.0 | 1.0 | 2.0 | 0.53736 |
| Teenhome | 2193.0 | 0.507068 | 0.0 | 0.0 | 0.0 | 1.0 | 2.0 | 0.544607 |
| Dt_Customer | 2193 | 2013-07-09 17:47:01.887824896 | 2012-07-30 00:00:00 | 2013-01-16 00:00:00 | 2013-07-08 00:00:00 | 2013-12-30 00:00:00 | 2014-06-29 00:00:00 | NaN |
| Recency | 2193.0 | 49.103055 | 0.0 | 24.0 | 49.0 | 74.0 | 99.0 | 28.939992 |
| MntWines | 2193.0 | 306.962152 | 0.0 | 24.0 | 178.0 | 508.0 | 1493.0 | 337.93087 |
| MntFruits | 2193.0 | 26.4601 | 0.0 | 2.0 | 8.0 | 33.0 | 199.0 | 39.81754 |
| MntMeatProducts | 2193.0 | 165.845873 | 0.0 | 16.0 | 68.0 | 233.0 | 1725.0 | 218.031444 |
| MntFishProducts | 2193.0 | 37.75969 | 0.0 | 3.0 | 12.0 | 50.0 | 259.0 | 54.690012 |
| MntSweetProducts | 2193.0 | 27.116279 | 0.0 | 1.0 | 8.0 | 34.0 | 198.0 | 40.896642 |
| MntGoldProds | 2193.0 | 44.033744 | 0.0 | 9.0 | 25.0 | 57.0 | 321.0 | 51.59191 |
| NumDealsPurchases | 2193.0 | 2.305974 | 0.0 | 1.0 | 2.0 | 3.0 | 15.0 | 1.846739 |
| NumWebPurchases | 2193.0 | 4.094391 | 0.0 | 2.0 | 4.0 | 6.0 | 25.0 | 2.688807 |
| NumCatalogPurchases | 2193.0 | 2.651163 | 0.0 | 0.0 | 2.0 | 4.0 | 28.0 | 2.797136 |
| NumStorePurchases | 2193.0 | 5.838577 | 0.0 | 3.0 | 5.0 | 8.0 | 13.0 | 3.237332 |
| NumWebVisitsMonth | 2193.0 | 5.317829 | 0.0 | 3.0 | 6.0 | 7.0 | 20.0 | 2.360668 |
| AcceptedCmp3 | 2193.0 | 0.073871 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.261621 |
| AcceptedCmp4 | 2193.0 | 0.074783 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.263102 |
| AcceptedCmp5 | 2193.0 | 0.072959 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.260129 |
| AcceptedCmp1 | 2193.0 | 0.064295 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.245335 |
| AcceptedCmp2 | 2193.0 | 0.013224 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.114258 |
| Complain | 2193.0 | 0.00912 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.095084 |
| Response | 2193.0 | 0.150479 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.357622 |
| Tot_Family_Mem | 2193.0 | 2.338349 | 1.0 | 2.0 | 2.0 | 3.0 | 5.0 | 0.900868 |
| Customer_Tenure | 2193.0 | 4193.259006 | 3839.0 | 4020.0 | 4195.0 | 4368.0 | 4538.0 | 202.135423 |
| TotalAcceptedCmp | 2193.0 | 0.299134 | 0.0 | 0.0 | 0.0 | 0.0 | 4.0 | 0.67872 |
| Age | 2193.0 | 56.098039 | 29.0 | 48.0 | 55.0 | 66.0 | 85.0 | 11.700839 |
| Total_Kids | 2193.0 | 0.949384 | 0.0 | 0.0 | 1.0 | 1.0 | 3.0 | 0.748899 |
| Total_Spending | 2193.0 | 608.177839 | 8.0 | 69.0 | 397.0 | 1048.0 | 2525.0 | 602.227874 |
| Total_Num_Purchases | 2193.0 | 12.584131 | 1.0 | 6.0 | 12.0 | 18.0 | 32.0 | 7.166577 |
| Amount_Spent_Per_Purchase | 2193.0 | 37.498756 | 2.666667 | 13.0 | 29.846154 | 49.125 | 187.666667 | 30.065162 |
| Single | 2193.0 | 0.352029 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.477712 |
| Well_Educated | 2193.0 | 0.975376 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.155011 |
# Define the features to be used for segmentation
segmentation_features = ['Total_Spending', 'Total_Num_Purchases', 'Amount_Spent_Per_Purchase', 'Total_Kids', 'Single', 'Well_Educated', "Income"]
# Create a new DataFrame with only the segmentation features
segmentation_df = df[segmentation_features]
# Plot the correlation matrix for the segmentation features
plt.figure(figsize=(12, 10))
sns.heatmap(segmentation_df.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix of Segmentation Features')
plt.show()
# Scale the data using StandardScaler
scaler = StandardScaler()
scaled_segmentation_df = scaler.fit_transform(segmentation_df)
# Convert the scaled data back to a DataFrame
scaled_segmentation_df = pd.DataFrame(scaled_segmentation_df, columns=segmentation_features)
# Display the first few rows of the scaled DataFrame
print(scaled_segmentation_df.head())
Total_Spending Total_Num_Purchases Amount_Spent_Per_Purchase Total_Kids \
0 1.675532 1.314158 1.197714 -1.267996
1 -0.965266 -1.198074 -1.022970 1.403201
2 0.278732 1.035021 0.043291 -1.267996
3 -0.922084 -0.918937 -0.953661 0.067602
4 -0.309219 0.197611 -0.244721 0.067602
Single Well_Educated Income
0 1.356714 0.158888 0.313643
1 1.356714 0.158888 -0.258814
2 -0.737075 0.158888 0.967692
3 -0.737075 0.158888 -1.214915
4 -0.737075 0.158888 0.321166
Applying T-SNE and PCA to the data to visualize the data distributed in 2 dimensions¶
Applying T-SNE¶
# Apply t-SNE to the scaled data
tsne = TSNE(n_components=2, random_state=42)
tsne_result = tsne.fit_transform(scaled_segmentation_df)
# Create a DataFrame for the t-SNE results
tsne_df = pd.DataFrame(data=tsne_result, columns=['TSNE1', 'TSNE2'])
# Plot the t-SNE results
plt.figure(figsize=(10, 8))
plt.scatter(tsne_df['TSNE1'], tsne_df['TSNE2'])
plt.title('t-SNE Visualization of Customer Segments')
plt.xlabel('TSNE Component 1')
plt.ylabel('TSNE Component 2')
plt.show()
Observation and Insights:
- A few possible clusters:
- Top middle
- Center middle
- Center right
- Bottom left
- Bottom middle
- and two on the bottom right.
Applying PCA¶
Think about it:
- Should we apply clustering algorithms on the current data or should we apply PCA on the data before applying clustering algorithms? How would this help?
When PCA helps:
High dimensionality: The data has many features. PCA reduces dimensionality by finding principal components that capture the most variance in the data. Clustering in a lower-dimensional space is often faster and can reveal clearer cluster structures since noise and irrelevant features are reduced.
Correlated features: If the features are highly correlated (as indicated by the correlation heatmap), PCA can decorrelate them. Clustering algorithms can struggle with correlated features, leading to less meaningful clusters.
When PCA might not help (or could hurt):
Interpretability: Principal components are linear combinations of the original features and are often harder to interpret. If understanding the meaning of each cluster in terms of the original features is important, PCA might obfuscate the insights. In this case, the original features are relatively interpretable (spending, age, tenure, etc.) while the PCs are not as directly meaningful.
Loss of information: PCA discards some information by reducing dimensionality. If the discarded variance contains important clustering information, PCA could harm the cluster quality.
In this specific case:
The code already shows the application of t-SNE before PCA. t-SNE is non-linear. While t-SNE is good at visualization, applying PCA to it before clustering likely offers little additional benefit. There is no clear need to apply PCA at all in this analysis. It's more useful as a means to visualize the result of t-SNE (and even then, t-SNE's own 2-component visualization is sufficient).
What to do instead:
Try Clustering directly on the scaled data (
scaled_segmentation_df): This is a good baseline.If computation time is an issue, then consider PCA: If the clustering on the 8 features of the scaled data is slow, reducing dimensions with PCA first is reasonable. Will try different numbers of components (e.g., 3, 4, 5) to find a balance between speed and information retention in that case.
Carefully evaluate results: Compare clustering solutions on the original scaled data and the PCA-transformed data. Use metrics like silhouette scores or visual inspection of cluster separations. Choose the solution that produces the most meaningful and stable clusters.
# Apply PCA to the scaled data
pca = PCA(n_components=2)
pca_result = pca.fit_transform(scaled_segmentation_df)
# Create a DataFrame for the PCA results
pca_df = pd.DataFrame(data=pca_result, columns=['PC1', 'PC2'])
# Plot the PCA results
plt.figure(figsize=(10, 8))
plt.scatter(pca_df['PC1'], pca_df['PC2'])
plt.title('PCA Visualization of Customer Segments')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()
Observation and Insights:
No clear pattern detected visually. There seems to be a line that can run through PC2 = 0 and angle slightly downwards. Seems like 2 clusters with stripes.
K-Means¶
Think About It:
- How do we determine the optimal K value from the elbow curve?
- Which metric can be used to determine the final K value?
# Determine optimal K value using the elbow method and silhouette score
inertia = []
silhouette_avg = []
K = range(2, 11) # Test K values from 2 to 10
for k in K:
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(scaled_segmentation_df)
inertia.append(kmeans.inertia_)
silhouette_avg.append(silhouette_score(scaled_segmentation_df, kmeans.labels_))
# Plot the elbow curve
plt.figure(figsize=(10, 5))
plt.plot(K, inertia, marker='o')
plt.title('Elbow Method for Optimal k')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.show()
# Plot the silhouette scores
plt.figure(figsize=(10, 5))
plt.plot(K, silhouette_avg, marker='o')
plt.title('Silhouette Score for Optimal k')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Silhouette Score')
plt.show()
# Apply K-means with the chosen optimal k
optimal_k = 4
kmeans = KMeans(n_clusters=optimal_k, random_state=42)
kmeans.fit(scaled_segmentation_df)
df['Cluster'] = kmeans.labels_
Observations:
- Since all silhouette scores are all low, the elbow test will be better.
- There is a clear elbow at k=4
Applying KMeans on the PCA data and visualize the clusters¶
# Apply K-means to the PCA data
kmeans_pca = KMeans(n_clusters=optimal_k, random_state=42)
kmeans_pca.fit(pca_df)
# Add cluster labels to the PCA DataFrame
pca_df['Cluster'] = kmeans_pca.labels_
# Visualize the clusters on the PCA plot
plt.figure(figsize=(10, 8))
for cluster in range(optimal_k):
plt.scatter(pca_df[pca_df['Cluster'] == cluster]['PC1'],
pca_df[pca_df['Cluster'] == cluster]['PC2'],
label=f'Cluster {cluster}')
plt.title('K-Means Clustering on PCA Data')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend()
plt.show()
PCA Does not show clusters very well. Without colors, a human eye would only detect two groupings: lower left which is currently green and the rest which is multi-colored. Maybe it is picking up on the stripes so that is why the program is sensing 4 clusters
Cluster Profiling¶
# Calculate cluster profiles with more features
cluster_profiles = df.groupby('Cluster').agg({
'Total_Spending': 'mean',
'Total_Num_Purchases': 'mean',
'Amount_Spent_Per_Purchase': 'mean',
'Customer_Tenure': 'mean',
'Age': 'mean',
'Total_Kids': 'mean',
'Single': 'mean',
'Well_Educated': 'mean',
'Income': 'mean',
'NumDealsPurchases': 'mean',
'NumWebPurchases': 'mean',
'NumCatalogPurchases': 'mean',
'NumStorePurchases': 'mean',
'NumWebVisitsMonth': 'mean',
'Response': 'mean',
'Complain': 'mean',
'Recency': 'mean',
'Tot_Family_Mem': 'mean'
}).reset_index()
# Get the number of features to plot
num_features = len(cluster_profiles.columns) - 1 # Subtract 1 to exclude 'Cluster' column
# Calculate the number of rows and columns for subplots
num_rows = int(np.ceil(num_features / 3)) # 3 columns per row
num_cols = 3
# Create the figure and subplots
fig, axes = plt.subplots(num_rows, num_cols, figsize=(18, num_rows * 5)) # Adjust figsize as needed
# Flatten the axes array for easier iteration
axes = axes.flatten()
# Iterate through features and create bar plots
for i, feature in enumerate(cluster_profiles.columns[1:]): # Start from index 1 to exclude 'Cluster'
sns.barplot(x='Cluster', y=feature, data=cluster_profiles, ax=axes[i])
axes[i].set_title(f'Average {feature} per Cluster')
# Hide any unused subplots
for i in range(num_features, num_rows * num_cols):
fig.delaxes(axes[i])
plt.tight_layout()
plt.show()
Observations and Insights: The means of some features are not very distinct between clusters which indicates that they are not very useful in segmentation or that a different k value in clustering would produce better segmentation with those vaues. For example, Age and Customer Tenure.
There are however many that differentiate quite strongly between clusters such as Total Spending and Income.
Describe the characteristics of each cluster¶
Think About It:
- Are the K-Means profiles providing any deep insights into customer purchasing behavior or which channels they are using?
- What is the next step to get more meaningful insights?
Summary of each cluster: Cluster 0:
- Lower Total Spending, Income, and Amount Spent Per Purchase.
- Fewer Total Purchases, indicating less frequent engagement.
- Slightly higher proportion of single individuals.
- Lower education levels.
- Preferentially uses the deals/ promotions.
- More likely to complain
- Lower recency
Cluster 1:
- Highest Total Spending and Income.
- High Amount Spent Per Purchase, suggesting larger purchases.
- Higher number of total purchases, indicating more frequent customer activity.
- Relatively balanced family composition and education levels.
- Moderate use of all channels, with a slight preference for web purchases.
- Less likely to complain
- Relatively high recency
Cluster 2:
- Moderate Total Spending and Income.
- Moderate Amount Spent Per Purchase and Total Purchases.
- Slightly higher proportion of families.
- Higher education levels
- Moderate use of web and catalog channels.
- Lower recency
- Less likely to complain
Cluster 3:
- Lower Total Spending and Income compared to Cluster 1.
- Moderate Amount Spent Per Purchase.
- Moderate number of total purchases.
- Higher proportion of families with kids/teens.
- Lower proportion of single people.
- Lower education level
- Moderate recency
- Less likely to complain
Pssoble Next Steps for Deeper Insights:
- RFM Analysis: Combine Recency, Frequency, and Monetary Value to create more robust segments.
- Channel Preferences: Analyze the preferred purchase channels (web, catalog, store) within each cluster to refine targeting strategies.
- Product Affinity: Investigate which product categories are most popular within each cluster, and use this to create targeted offers or promotions.
- Campaign Effectiveness: Evaluate the success of past marketing campaigns within each cluster.
- Customer Lifetime Value: Estimate the potential lifetime value of customers in each cluster.
- External Data Integration: Combine customer data with external sources like demographics and economic indicators.
- Qualitative Research: Conduct surveys or interviews with customers from each cluster to gather deeper insights into their motivations and behaviors.
K-Medoids¶
# Determine optimal K value using the elbow method and silhouette score for K-Medoids
inertia_kmedoids = []
silhouette_avg_kmedoids = []
K = range(2, 11) # Test K values from 2 to 10
for k in K:
kmedoids = KMedoids(n_clusters=k, random_state=42)
kmedoids.fit(scaled_segmentation_df)
inertia_kmedoids.append(kmedoids.inertia_)
silhouette_avg_kmedoids.append(silhouette_score(scaled_segmentation_df, kmedoids.labels_))
# Plot the elbow curve for K-Medoids
plt.figure(figsize=(10, 5))
plt.plot(K, inertia_kmedoids, marker='o')
plt.title('Elbow Method for Optimal k (K-Medoids)')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.show()
# Plot the silhouette scores for K-Medoids
plt.figure(figsize=(10, 5))
plt.plot(K, silhouette_avg_kmedoids, marker='o')
plt.title('Silhouette Score for Optimal k (K-Medoids)')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Silhouette Score')
plt.show()
# Apply K-Medoids with the chosen optimal k
optimal_k_kmedoids = 7 # clear elbow at 7
kmedoids = KMedoids(n_clusters=optimal_k_kmedoids, random_state=42)
kmedoids.fit(scaled_segmentation_df)
df['Cluster_KMedoids'] = kmedoids.labels_
Visualize the clusters using PCA¶
# Apply K-Medoids with PCA first
pca = PCA(n_components=2, random_state=42) # Use PCA for dimensionality reduction
pca_result = pca.fit_transform(scaled_segmentation_df)
optimal_k_kmedoids = 7 # Use the previously determined optimal k
kmedoids_pca = KMedoids(n_clusters=optimal_k_kmedoids, random_state=42)
kmedoids_pca.fit(pca_result)
# Visualize the clusters
plt.figure(figsize=(10, 8))
for cluster in range(optimal_k_kmedoids):
plt.scatter(pca_result[kmedoids_pca.labels_ == cluster, 0],
pca_result[kmedoids_pca.labels_ == cluster, 1],
label=f'Cluster {cluster}')
plt.title('K-Medoids Clustering with PCA Visualization')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend()
plt.show()
Cluster Profiling¶
# Calculate cluster profiles with more features for K-Medoids
cluster_profiles_kmedoids = df.groupby('Cluster_KMedoids').agg({
'Total_Spending': 'mean',
'Total_Num_Purchases': 'mean',
'Amount_Spent_Per_Purchase': 'mean',
'Customer_Tenure': 'mean',
'Age': 'mean',
'Total_Kids': 'mean',
'Single': 'mean',
'Well_Educated': 'mean',
'Income': 'mean',
'NumDealsPurchases': 'mean',
'NumWebPurchases': 'mean',
'NumCatalogPurchases': 'mean',
'NumStorePurchases': 'mean',
'NumWebVisitsMonth': 'mean',
'Response': 'mean',
'Complain': 'mean',
'Recency': 'mean',
'Tot_Family_Mem': 'mean'
}).reset_index()
# Get the number of features to plot
num_features = len(cluster_profiles_kmedoids.columns) - 1
# Calculate the number of rows and columns for subplots
num_rows = int(np.ceil(num_features / 3))
num_cols = 3
# Create the figure and subplots
fig, axes = plt.subplots(num_rows, num_cols, figsize=(18, num_rows * 5))
axes = axes.flatten()
# Iterate through features and create bar plots
for i, feature in enumerate(cluster_profiles_kmedoids.columns[1:]):
sns.barplot(x='Cluster_KMedoids', y=feature, data=cluster_profiles_kmedoids, ax=axes[i])
axes[i].set_title(f'Average {feature} per Cluster (K-Medoids)')
# Hide any unused subplots
for i in range(num_features, num_rows * num_cols):
fig.delaxes(axes[i])
plt.tight_layout()
plt.show()
Observations and Insights:
K-Medoids Optimal k: The elbow method for K-Medoids suggests an optimal k value of 7, as indicated by a clear elbow point in the inertia plot at k=7. The silhouette scores also support this, showing relatively better separation at that point.
K-Medoids Cluster Visualization (PCA): The PCA visualization of K-Medoids clusters shows some separation between groups, but it's not as distinct as the t-SNE visualizations performed earlier. PCA, by its nature, aims to capture the most variance, not necessarily to optimize for visual separation of clusters. It's important to note that using PCA as a visualization tool may not accurately reflect the true clustering structure, especially when dealing with non-linear relationships in the data. This is again a point in favor of using t-SNE or other similar approaches for visualization if the goal is to view cluster separation graphically.
K-Medoids Cluster Profiling: The cluster profiles generated by K-Medoids provide insights into the characteristics of each cluster. By analyzing the mean values of various features across clusters, one can observe the differences between customer segments. For example, one might find one cluster with higher average spending and income, another with frequent small purchases, and others with distinct channel preferences. These cluster means provide evidence for whether the clusters are meaningful and reveal characteristics associated with them. For example and very similar results to the k-means: Customer Tenure, Age, and Recency do not have much of an effect on the overall result.Well-educated also does not have much of an impact with K-medoids but does with K-Means.
Characteristics of each cluster¶
Summary for each cluster:
Observations and Insights:
K-Medoids Cluster Analysis: Characteristics of 7 Clusters
Cluster 0:
- Summary: This cluster represents customers with moderate spending, income, and purchase frequency. They exhibit a balanced approach across various purchasing channels and show a moderate level of engagement with the brand. They are moderately likely to complain, but their recency is low.
- Characteristics: Average spending and income, moderate purchase frequency, balanced channel usage, and likely less engaged with recent promotions or marketing efforts. They show a moderate level of response to marketing.
- Observations and Insights: This segment represents a typical customer base, and targeted campaigns could focus on increasing engagement and average order value.
Cluster 1:
- Summary: Customers in this cluster exhibit high total spending and income, along with a high amount spent per purchase, showing a tendency toward larger purchases. They are the highest spending cluster and the most engaged overall. They are the least likely to complain.
- Characteristics: High-value customers with significant purchasing power, likely loyal to the brand, and responsive to premium offerings. They exhibit a high recency.
- Observations and Insights: Focus on retention strategies, exclusive offers, and premium services to cater to their high-value nature.
Cluster 2:
- Summary: This group demonstrates the lowest average spending and income of all clusters. Their frequency of purchases is also low, and they're highly price-sensitive.
- Characteristics: Lower spending, lower income, low purchase frequency, and a potential inclination towards promotions and deals. Their recency is low. They are less likely to respond to marketing efforts.
- Observations and Insights: Focus on attracting and retaining these customers with budget-friendly offers and promotions and a high recency.
Cluster 3:
- Summary: This cluster represents customers with moderate spending and income, similar to cluster 0, but they show a higher propensity for online purchases.
- Characteristics: Moderate spending, income, and purchase frequency, but with a strong preference for online channels.
- Observations and Insights: Target these customers with online-focused promotions, personalized recommendations, and enhanced digital experience.
Cluster 4:
- Summary: Characterized by relatively low spending and purchase frequency, these customers exhibit a preference for catalog purchases.
- Characteristics: Low spending and frequency, but a distinctive preference for catalog purchases.
- Observations and Insights: Catalogs may be a primary communication channel for this group. Further investigation might be worthwhile.
Cluster 5:
- Summary: This cluster has a balance of characteristics. They are moderately frequent purchasers with a propensity to make higher-value purchases than some other clusters.
- Characteristics: Moderate spending, moderate-to-high income, relatively high amount spent per purchase, and a balance across purchase channels.
- Observations and Insights: These are potentially valuable customers, deserving more attention and potentially higher-value offers to encourage continued loyalty and spending.
Cluster 6:
- Summary: This segment displays a high level of recent engagement. They may be new customers, or those who have recently become more active.
- Characteristics: High recency but with a wide range of spending and income levels.
- Observations and Insights: Explore their recent engagement and tailor offers based on individual spending habits and potential. They could be highly receptive to promotional offers.
Hierarchical Clustering¶
- Find the Cophenetic correlation for different distances with different linkage methods.
- Create the dendrograms for different linkages
- Explore different linkages with each distance metric
# Define distance metrics and linkage methods
distance_metrics = ['euclidean', 'cosine', 'cityblock'] # Example distances
linkage_methods = ['ward', 'complete', 'average', 'single'] # Example linkages
for distance_metric in distance_metrics:
for linkage_method in linkage_methods:
# Calculate pairwise distances
distance_matrix = pdist(scaled_segmentation_df, metric=distance_metric)
# Apply hierarchical clustering
Z = sch.linkage(distance_matrix, method=linkage_method)
# Calculate cophenetic correlation
c, coph_dists = cophenet(Z, distance_matrix)
print(f"Distance metric: {distance_metric}, Linkage method: {linkage_method}, Cophenetic correlation: {c}")
# Create dendrogram
plt.figure(figsize=(10, 7))
sch.dendrogram(Z)
plt.title(f"Dendrogram - Distance: {distance_metric}, Linkage: {linkage_method}")
plt.xlabel("Data Points")
plt.ylabel("Euclidean Distance")
plt.show()
Distance metric: euclidean, Linkage method: ward, Cophenetic correlation: 0.6123351772311272
Distance metric: euclidean, Linkage method: complete, Cophenetic correlation: 0.7554965204185778
Distance metric: euclidean, Linkage method: average, Cophenetic correlation: 0.8115901083822973
Distance metric: euclidean, Linkage method: single, Cophenetic correlation: 0.6645719278052516
Distance metric: cosine, Linkage method: ward, Cophenetic correlation: 0.790770108442397
Distance metric: cosine, Linkage method: complete, Cophenetic correlation: 0.6142869145599397
Distance metric: cosine, Linkage method: average, Cophenetic correlation: 0.8228621339320061
Distance metric: cosine, Linkage method: single, Cophenetic correlation: 0.12150149904066151
Distance metric: cityblock, Linkage method: ward, Cophenetic correlation: 0.6609966974251326
Distance metric: cityblock, Linkage method: complete, Cophenetic correlation: 0.6885602896156096
Distance metric: cityblock, Linkage method: average, Cophenetic correlation: 0.7383057177388052
Distance metric: cityblock, Linkage method: single, Cophenetic correlation: 0.5360541365004085
Think about it:
- Can we clearly decide the number of clusters based on where to cut the dendrogram horizontally?
No, deciding the number of clusters solely based on a visual inspection of the dendrogram's horizontal cut can be subjective. While a dendrogram helps visualize the hierarchical relationships between data points, an optimal horizontal cut isn't always obvious. Different cuts will lead to different numbers of clusters. There is no single, universally correct place to cut the dendrogram.
- What is the next step in obtaining number of clusters based on the dendrogram?
The next step is to use a more rigorous approach than simply visual inspection. Here are some methods:
- Inconsistent Method: This method looks at the height of the fusion in the dendrogram and tries to find an appropriate cutoff based on inconsistencies in the heights. This is useful for determining significant clusters.
- Height Cutoff: Define a threshold for the distance or height on the dendrogram where one wants to make the cut. This value is arbitrary, but one might base it on domain expertise, or look for large jumps in heights in the dendrogram.
- Analyze Cophenetic Correlation: The cophenetic correlation coefficient measures how faithfully the dendrogram preserves the original pairwise distances between data points. A higher cophenetic correlation coefficient suggests a better representation of the data, and might indirectly give one an idea of the number of clusters. However, it's not a direct measure of the optimal number of clusters.
- Silhouette Score or other evaluation metrics: After clustering based on cutting the dendrogram, apply traditional methods to assess the number of clusters.
These approaches can provide more objective guidance.
- Are there any distinct clusters in any of the dendrograms?
- Each dendrogram has distinct tall horizontal lines connect at the top by wide and also narrow vertical lines.
- Some dendrograms are list distinct such as Complete linkages and some of the Average linkages.
Visualize the clusters using PCA¶
# Apply PCA for dimensionality reduction
pca = PCA(n_components=2, random_state=42)
pca_result = pca.fit_transform(scaled_segmentation_df)
# Assuming 'Z' is defined from the hierarchical clustering in the previous cell (ipython-input-49-c43885c2890a)
# Get cluster labels using fcluster, specifying the desired number of clusters (e.g., 4)
labels = fcluster(Z, t=4, criterion='maxclust')
# Visualize the clusters
plt.figure(figsize=(10, 8))
for cluster in np.unique(labels): # Use np.unique to find unique cluster labels
plt.scatter(pca_result[labels == cluster, 0], pca_result[labels == cluster, 1], label=f'Cluster {cluster}')
plt.title('Hierarchical Clustering with PCA Visualization')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend()
plt.show()
Cluster Profiling¶
# Calculate cluster profiles for hierarchical clustering
cluster_profiles_hierarchical = df.groupby(labels).agg({
'Total_Spending': 'mean',
'Total_Num_Purchases': 'mean',
'Amount_Spent_Per_Purchase': 'mean',
'Customer_Tenure': 'mean',
'Age': 'mean',
'Total_Kids': 'mean',
'Single': 'mean',
'Well_Educated': 'mean',
'Income': 'mean',
'NumDealsPurchases': 'mean',
'NumWebPurchases': 'mean',
'NumCatalogPurchases': 'mean',
'NumStorePurchases': 'mean',
'NumWebVisitsMonth': 'mean',
'Response': 'mean',
'Complain': 'mean',
'Recency': 'mean',
'Tot_Family_Mem': 'mean'
}).reset_index()
# Plotting cluster profiles for hierarchical clustering
num_features = len(cluster_profiles_hierarchical.columns) - 1
num_rows = int(np.ceil(num_features / 3))
num_cols = 3
fig, axes = plt.subplots(num_rows, num_cols, figsize=(18, num_rows * 5))
axes = axes.flatten()
for i, feature in enumerate(cluster_profiles_hierarchical.columns[1:]):
sns.barplot(x='index', y=feature, data=cluster_profiles_hierarchical, ax=axes[i])
axes[i].set_title(f'Average {feature} per Cluster (Hierarchical)')
for i in range(num_features, num_rows * num_cols):
fig.delaxes(axes[i])
plt.tight_layout()
plt.show()
Observations and Insights:
- Cluster 3 (Green) contains most of the data points. Next is cluster 1 (blue), Then cluster 4 and 2 with what seems to be two and one data points respectively.
- Feature Importance: The consistent observation across all three clustering methods (K-Means, K-Medoids, Hierarchical) that Customer Tenure, Age, and Recency have minimal impact suggests that these factors might not be strong drivers of customer segmentation in this dataset. This implies that other variables like spending habits, purchase frequency, and other preferences may be more important for defining customer segments.
- Only cluster 3 has complaints, and likewise has data points in every cloumn.
- Heirarchical clustering does not capture clusters very well. Ideally only cluster 1 and 3 would capture the clustering the best. Two clusters is better than four in this case.
Characteristics of each cluster¶
Summary of each cluster:
Characteristics of 4 Clusters (Based on provided visualization and profiling)
Cluster 1:
- Summary: This cluster likely represents a large portion of the customer base, exhibiting a wide range of characteristics but possibly with a slightly higher than average spending and total purchases. It seems to be the most common profile. They also seem to be most likely to complain.
- Characteristics: Moderate to high spending, moderate purchase frequency, varied channel usage.
Cluster 2:
- Summary: This cluster appears to contain very few customers and is relatively distinct.
- Characteristics: Low frequency of purchases and spending, with the caveat that the small sample size limits drawing strong conclusions.
Cluster 3:
- Summary: This cluster shows characteristics similar to the largest cluster but with potentially lower spending, total purchases, and a different channel preference. Possibly representing customers who are less engaged or prefer alternative purchase channels.
- Characteristics: Moderate spending and frequency, possibly with a stronger preference for specific purchase channels.
Cluster 4:
- Summary: Another small cluster, potentially representing another niche group of customers.
- Characteristics: Difficult to determine clear characteristics due to the limited number of data points.
DBSCAN¶
DBSCAN is a very powerful algorithm for finding high-density clusters, but the problem is determining the best set of hyperparameters to use with it. It includes two hyperparameters, eps, and min samples.
Since it is an unsupervised algorithm, you have no control over it, unlike a supervised learning algorithm, which allows you to test your algorithm on a validation set. The approach we can follow is basically trying out a bunch of different combinations of values and finding the silhouette score for each of them.
# Define the parameter grid to search
eps_values = [0.1, 0.5, 1, 2, 5] # Example eps values
min_samples_values = [2, 5, 10, 20] # Example min_samples values
best_score = -1
best_eps = None
best_min_samples = None
for eps in eps_values:
for min_samples in min_samples_values:
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
labels = dbscan.fit_predict(scaled_segmentation_df)
# Check if more than one cluster is found
if len(set(labels)) > 1: # Ignore cases with only noise or one cluster
score = silhouette_score(scaled_segmentation_df, labels)
if score > best_score:
best_score = score
best_eps = eps
best_min_samples = min_samples
print(f"eps: {eps}, min_samples: {min_samples}, silhouette score: {score}")
print(f"\nBest parameters: eps={best_eps}, min_samples={best_min_samples}, silhouette score={best_score}")
# Fit DBSCAN with the best hyperparameters
best_dbscan = DBSCAN(eps=best_eps, min_samples=best_min_samples)
best_labels = best_dbscan.fit_predict(scaled_segmentation_df)
#Now one can use 'best_labels' for further analysis like cluster profiling
eps: 0.1, min_samples: 2, silhouette score: -0.14111668583144424 eps: 0.1, min_samples: 5, silhouette score: -0.4536308807195296 eps: 0.1, min_samples: 10, silhouette score: -0.251366827044772 eps: 0.5, min_samples: 2, silhouette score: 0.052830037298867406 eps: 0.5, min_samples: 5, silhouette score: 0.13500354391650107 eps: 0.5, min_samples: 10, silhouette score: 0.0811168552184625 eps: 0.5, min_samples: 20, silhouette score: -0.009236040212927215 eps: 1, min_samples: 2, silhouette score: 0.11084030466103999 eps: 1, min_samples: 5, silhouette score: 0.14564400885419426 eps: 1, min_samples: 10, silhouette score: 0.10289200663344249 eps: 1, min_samples: 20, silhouette score: 0.13132599165460845 eps: 2, min_samples: 2, silhouette score: 0.24181438952337633 eps: 2, min_samples: 5, silhouette score: 0.24159549640464362 eps: 2, min_samples: 10, silhouette score: 0.24159549640464362 eps: 2, min_samples: 20, silhouette score: 0.24155740743355267 eps: 5, min_samples: 2, silhouette score: 0.5750308184494759 eps: 5, min_samples: 5, silhouette score: 0.5750308184494759 eps: 5, min_samples: 10, silhouette score: 0.5750308184494759 eps: 5, min_samples: 20, silhouette score: 0.5750308184494759 Best parameters: eps=5, min_samples=2, silhouette score=0.5750308184494759
Apply DBSCAN for the best hyperparameter and visualize the clusters from PCA¶
# Apply PCA for dimensionality reduction
pca = PCA(n_components=2, random_state=42)
pca_result = pca.fit_transform(scaled_segmentation_df)
# Visualize the clusters
plt.figure(figsize=(10, 8))
for cluster in np.unique(best_labels):
plt.scatter(pca_result[best_labels == cluster, 0], pca_result[best_labels == cluster, 1], label=f'Cluster {cluster}')
plt.title('DBSCAN Clustering with PCA Visualization (Best Hyperparameters)')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend()
plt.show()
Observations and Insights:
- Two clusters seems to be the best overall custering options. Visually, there are clearly two distinct seperations. Even though the elbow test didn't reveal it in the K-Means and K-Medoids, the K-Means' and K-Medoids' silhouette scores had k=2 as the highest score, and now likewise with DBSCAN.
- Visually there are also about 7 horisontal lines/stripes of data points, so plausibly 7 clusters would work too, albeit difficult to programmatically solve. There are 5 stripes in cluster 0 (blue) and 2 in cluster 1 (orange), but 2 clusters still seems most clear.
Think about it:
- Changing the eps and min sample values will result in different DBSCAN results? Can we try more value for eps and min_sample?
No, as seen below, it maxes out at silhouette score=0.5750308184494759 which is what it reached above anyways. The above selection of eps and min Sample showed an upward trend of increasing silhouette scores as those two variables increased. Adding more values intermediarily would not have change that silhouette score since it eventually met that 0.575... score at the max of each interval of the those variables. Adding more would have just meant that one would have found that silhouette score much sooner, which is what is shown below, but the score remained the same.
# Expanding the parameter grid for DBSCAN
eps_values = np.linspace(0.1, 5, 20) # More granular eps values
min_samples_values = [2, 3, 4, 5, 10, 15, 20] # More min_samples values
best_score = -1
best_eps = None
best_min_samples = None
for eps in eps_values:
for min_samples in min_samples_values:
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
labels = dbscan.fit_predict(scaled_segmentation_df)
# Check if more than one cluster is found
if len(set(labels)) > 1: # Ignore cases with only noise or one cluster
score = silhouette_score(scaled_segmentation_df, labels)
if score > best_score:
best_score = score
best_eps = eps
best_min_samples = min_samples
print(f"eps: {eps}, min_samples: {min_samples}, silhouette score: {score}")
print(f"\nBest parameters: eps={best_eps}, min_samples={best_min_samples}, silhouette score={best_score}")
eps: 0.1, min_samples: 2, silhouette score: -0.14111668583144424 eps: 0.1, min_samples: 3, silhouette score: -0.4105862013979167 eps: 0.1, min_samples: 4, silhouette score: -0.40478304072544663 eps: 0.1, min_samples: 5, silhouette score: -0.4536308807195296 eps: 0.1, min_samples: 10, silhouette score: -0.251366827044772 eps: 0.35789473684210527, min_samples: 2, silhouette score: -0.015290880843999769 eps: 0.35789473684210527, min_samples: 3, silhouette score: 0.016117441256415226 eps: 0.35789473684210527, min_samples: 4, silhouette score: 0.045999073919817966 eps: 0.35789473684210527, min_samples: 5, silhouette score: 0.014971351444800638 eps: 0.35789473684210527, min_samples: 10, silhouette score: -0.023981122080325056 eps: 0.35789473684210527, min_samples: 15, silhouette score: -0.06229472779327022 eps: 0.35789473684210527, min_samples: 20, silhouette score: -0.11974086168355594 eps: 0.6157894736842106, min_samples: 2, silhouette score: 0.02438652516747386 eps: 0.6157894736842106, min_samples: 3, silhouette score: 0.07509040963529537 eps: 0.6157894736842106, min_samples: 4, silhouette score: 0.0756178083308249 eps: 0.6157894736842106, min_samples: 5, silhouette score: 0.10128874193019895 eps: 0.6157894736842106, min_samples: 10, silhouette score: 0.1601829469663608 eps: 0.6157894736842106, min_samples: 15, silhouette score: 0.1198293670185905 eps: 0.6157894736842106, min_samples: 20, silhouette score: 0.07350977071061665 eps: 0.8736842105263158, min_samples: 2, silhouette score: 0.08928320061893627 eps: 0.8736842105263158, min_samples: 3, silhouette score: 0.09469074961458734 eps: 0.8736842105263158, min_samples: 4, silhouette score: 0.10367563469295368 eps: 0.8736842105263158, min_samples: 5, silhouette score: 0.10367563469295368 eps: 0.8736842105263158, min_samples: 10, silhouette score: 0.08493103368559306 eps: 0.8736842105263158, min_samples: 15, silhouette score: 0.12582354153312378 eps: 0.8736842105263158, min_samples: 20, silhouette score: 0.15589274057400293 eps: 1.1315789473684212, min_samples: 2, silhouette score: 0.1122364288839039 eps: 1.1315789473684212, min_samples: 3, silhouette score: 0.12124520095358417 eps: 1.1315789473684212, min_samples: 4, silhouette score: 0.1456398424482757 eps: 1.1315789473684212, min_samples: 5, silhouette score: 0.1456398424482757 eps: 1.1315789473684212, min_samples: 10, silhouette score: 0.1422729915068303 eps: 1.1315789473684212, min_samples: 15, silhouette score: 0.13367185710740154 eps: 1.1315789473684212, min_samples: 20, silhouette score: 0.1259998614145595 eps: 1.3894736842105266, min_samples: 2, silhouette score: 0.212723241535456 eps: 1.3894736842105266, min_samples: 3, silhouette score: 0.2412526938506453 eps: 1.3894736842105266, min_samples: 4, silhouette score: 0.2412526938506453 eps: 1.3894736842105266, min_samples: 5, silhouette score: 0.24068433216959192 eps: 1.3894736842105266, min_samples: 10, silhouette score: 0.23995780132929045 eps: 1.3894736842105266, min_samples: 15, silhouette score: 0.23957477013912357 eps: 1.3894736842105266, min_samples: 20, silhouette score: 0.2328067638417386 eps: 1.6473684210526318, min_samples: 2, silhouette score: 0.2323989932128011 eps: 1.6473684210526318, min_samples: 3, silhouette score: 0.24155740743355267 eps: 1.6473684210526318, min_samples: 4, silhouette score: 0.24155740743355267 eps: 1.6473684210526318, min_samples: 5, silhouette score: 0.24155740743355267 eps: 1.6473684210526318, min_samples: 10, silhouette score: 0.24158442699821794 eps: 1.6473684210526318, min_samples: 15, silhouette score: 0.24160332416774805 eps: 1.6473684210526318, min_samples: 20, silhouette score: 0.24160332416774805 eps: 1.905263157894737, min_samples: 2, silhouette score: 0.2323989932128011 eps: 1.905263157894737, min_samples: 3, silhouette score: 0.24155740743355267 eps: 1.905263157894737, min_samples: 4, silhouette score: 0.24155740743355267 eps: 1.905263157894737, min_samples: 5, silhouette score: 0.24155740743355267 eps: 1.905263157894737, min_samples: 10, silhouette score: 0.24155740743355267 eps: 1.905263157894737, min_samples: 15, silhouette score: 0.24155740743355267 eps: 1.905263157894737, min_samples: 20, silhouette score: 0.24142363970547726 eps: 2.1631578947368424, min_samples: 2, silhouette score: 0.3203862077714656 eps: 2.1631578947368424, min_samples: 3, silhouette score: 0.40996695559039703 eps: 2.1631578947368424, min_samples: 4, silhouette score: 0.40996695559039703 eps: 2.1631578947368424, min_samples: 5, silhouette score: 0.40996695559039703 eps: 2.1631578947368424, min_samples: 10, silhouette score: 0.40996695559039703 eps: 2.1631578947368424, min_samples: 15, silhouette score: 0.40996695559039703 eps: 2.1631578947368424, min_samples: 20, silhouette score: 0.40996695559039703 eps: 2.421052631578948, min_samples: 2, silhouette score: 0.3492096468722446 eps: 2.421052631578948, min_samples: 3, silhouette score: 0.4377385817064511 eps: 2.421052631578948, min_samples: 4, silhouette score: 0.4377385817064511 eps: 2.421052631578948, min_samples: 5, silhouette score: 0.4377385817064511 eps: 2.421052631578948, min_samples: 10, silhouette score: 0.4377385817064511 eps: 2.421052631578948, min_samples: 15, silhouette score: 0.4377385817064511 eps: 2.421052631578948, min_samples: 20, silhouette score: 0.4377385817064511 eps: 2.678947368421053, min_samples: 2, silhouette score: 0.37859312886582797 eps: 2.678947368421053, min_samples: 3, silhouette score: 0.4689235915249308 eps: 2.678947368421053, min_samples: 4, silhouette score: 0.4689235915249308 eps: 2.678947368421053, min_samples: 5, silhouette score: 0.4689235915249308 eps: 2.678947368421053, min_samples: 10, silhouette score: 0.4689235915249308 eps: 2.678947368421053, min_samples: 15, silhouette score: 0.4689235915249308 eps: 2.678947368421053, min_samples: 20, silhouette score: 0.4689235915249308 eps: 2.936842105263158, min_samples: 2, silhouette score: 0.4689235915249308 eps: 2.936842105263158, min_samples: 3, silhouette score: 0.4689235915249308 eps: 2.936842105263158, min_samples: 4, silhouette score: 0.4689235915249308 eps: 2.936842105263158, min_samples: 5, silhouette score: 0.4689235915249308 eps: 2.936842105263158, min_samples: 10, silhouette score: 0.4689235915249308 eps: 2.936842105263158, min_samples: 15, silhouette score: 0.4689235915249308 eps: 2.936842105263158, min_samples: 20, silhouette score: 0.4689235915249308 eps: 3.1947368421052635, min_samples: 2, silhouette score: 0.4689235915249308 eps: 3.1947368421052635, min_samples: 3, silhouette score: 0.4689235915249308 eps: 3.1947368421052635, min_samples: 4, silhouette score: 0.4689235915249308 eps: 3.1947368421052635, min_samples: 5, silhouette score: 0.4689235915249308 eps: 3.1947368421052635, min_samples: 10, silhouette score: 0.4689235915249308 eps: 3.1947368421052635, min_samples: 15, silhouette score: 0.4689235915249308 eps: 3.1947368421052635, min_samples: 20, silhouette score: 0.4689235915249308 eps: 3.452631578947369, min_samples: 2, silhouette score: 0.5750308184494759 eps: 3.452631578947369, min_samples: 3, silhouette score: 0.5750308184494759 eps: 3.452631578947369, min_samples: 4, silhouette score: 0.5750308184494759 eps: 3.452631578947369, min_samples: 5, silhouette score: 0.5019438151959774 eps: 3.452631578947369, min_samples: 10, silhouette score: 0.5019438151959774 eps: 3.452631578947369, min_samples: 15, silhouette score: 0.5019438151959774 eps: 3.452631578947369, min_samples: 20, silhouette score: 0.5019438151959774 eps: 3.710526315789474, min_samples: 2, silhouette score: 0.5750308184494759 eps: 3.710526315789474, min_samples: 3, silhouette score: 0.5750308184494759 eps: 3.710526315789474, min_samples: 4, silhouette score: 0.5750308184494759 eps: 3.710526315789474, min_samples: 5, silhouette score: 0.5750308184494759 eps: 3.710526315789474, min_samples: 10, silhouette score: 0.5019438151959774 eps: 3.710526315789474, min_samples: 15, silhouette score: 0.5019438151959774 eps: 3.710526315789474, min_samples: 20, silhouette score: 0.5019438151959774 eps: 3.9684210526315793, min_samples: 2, silhouette score: 0.5750308184494759 eps: 3.9684210526315793, min_samples: 3, silhouette score: 0.5750308184494759 eps: 3.9684210526315793, min_samples: 4, silhouette score: 0.5750308184494759 eps: 3.9684210526315793, min_samples: 5, silhouette score: 0.5750308184494759 eps: 3.9684210526315793, min_samples: 10, silhouette score: 0.5750308184494759 eps: 3.9684210526315793, min_samples: 15, silhouette score: 0.5019438151959774 eps: 3.9684210526315793, min_samples: 20, silhouette score: 0.5019438151959774 eps: 4.226315789473684, min_samples: 2, silhouette score: 0.5750308184494759 eps: 4.226315789473684, min_samples: 3, silhouette score: 0.5750308184494759 eps: 4.226315789473684, min_samples: 4, silhouette score: 0.5750308184494759 eps: 4.226315789473684, min_samples: 5, silhouette score: 0.5750308184494759 eps: 4.226315789473684, min_samples: 10, silhouette score: 0.5750308184494759 eps: 4.226315789473684, min_samples: 15, silhouette score: 0.5750308184494759 eps: 4.226315789473684, min_samples: 20, silhouette score: 0.5750308184494759 eps: 4.484210526315789, min_samples: 2, silhouette score: 0.5750308184494759 eps: 4.484210526315789, min_samples: 3, silhouette score: 0.5750308184494759 eps: 4.484210526315789, min_samples: 4, silhouette score: 0.5750308184494759 eps: 4.484210526315789, min_samples: 5, silhouette score: 0.5750308184494759 eps: 4.484210526315789, min_samples: 10, silhouette score: 0.5750308184494759 eps: 4.484210526315789, min_samples: 15, silhouette score: 0.5750308184494759 eps: 4.484210526315789, min_samples: 20, silhouette score: 0.5750308184494759 eps: 4.742105263157895, min_samples: 2, silhouette score: 0.5750308184494759 eps: 4.742105263157895, min_samples: 3, silhouette score: 0.5750308184494759 eps: 4.742105263157895, min_samples: 4, silhouette score: 0.5750308184494759 eps: 4.742105263157895, min_samples: 5, silhouette score: 0.5750308184494759 eps: 4.742105263157895, min_samples: 10, silhouette score: 0.5750308184494759 eps: 4.742105263157895, min_samples: 15, silhouette score: 0.5750308184494759 eps: 4.742105263157895, min_samples: 20, silhouette score: 0.5750308184494759 eps: 5.0, min_samples: 2, silhouette score: 0.5750308184494759 eps: 5.0, min_samples: 3, silhouette score: 0.5750308184494759 eps: 5.0, min_samples: 4, silhouette score: 0.5750308184494759 eps: 5.0, min_samples: 5, silhouette score: 0.5750308184494759 eps: 5.0, min_samples: 10, silhouette score: 0.5750308184494759 eps: 5.0, min_samples: 15, silhouette score: 0.5750308184494759 eps: 5.0, min_samples: 20, silhouette score: 0.5750308184494759 Best parameters: eps=3.452631578947369, min_samples=2, silhouette score=0.5750308184494759
Characteristics of each cluster¶
Summary of each cluster:
# Calculate cluster profiles for DBSCAN clustering
cluster_profiles_dbscan = df.groupby(best_labels).agg({
'Total_Spending': 'mean',
'Total_Num_Purchases': 'mean',
'Amount_Spent_Per_Purchase': 'mean',
'Customer_Tenure': 'mean',
'Age': 'mean',
'Total_Kids': 'mean',
'Single': 'mean',
'Well_Educated': 'mean',
'Income': 'mean',
'NumDealsPurchases': 'mean',
'NumWebPurchases': 'mean',
'NumCatalogPurchases': 'mean',
'NumStorePurchases': 'mean',
'NumWebVisitsMonth': 'mean',
'Response': 'mean',
'Complain': 'mean',
'Recency': 'mean',
'Tot_Family_Mem': 'mean'
}).reset_index()
# Plotting cluster profiles for DBSCAN clustering
num_features = len(cluster_profiles_dbscan.columns) - 1
num_rows = int(np.ceil(num_features / 3))
num_cols = 3
fig, axes = plt.subplots(num_rows, num_cols, figsize=(18, num_rows * 5))
axes = axes.flatten()
for i, feature in enumerate(cluster_profiles_dbscan.columns[1:]):
sns.barplot(x='index', y=feature, data=cluster_profiles_dbscan, ax=axes[i])
axes[i].set_title(f'Average {feature} per Cluster (DBSCAN)')
for i in range(num_features, num_rows * num_cols):
fig.delaxes(axes[i])
plt.tight_layout()
plt.show()
# Characteristics of each cluster (DBSCAN)
# Assuming best_labels has two clusters (0 and 1)
print("Characteristics of Clusters (DBSCAN):")
# Cluster 0
print("\nCluster 0:")
print(cluster_profiles_dbscan.iloc[0])
# Cluster 1
print("\nCluster 1:")
cluster_profiles_dbscan.iloc[1]
Characteristics of Clusters (DBSCAN): Cluster 0: index 0.000000 Total_Spending 621.466573 Total_Num_Purchases 12.769986 Amount_Spent_Per_Purchase 38.120923 Customer_Tenure 4191.392239 Age 56.314165 Total_Kids 0.955119 Single 0.351566 Well_Educated 1.000000 Income 52468.148200 NumDealsPurchases 2.318841 NumWebPurchases 4.150070 NumCatalogPurchases 2.705937 NumStorePurchases 5.913978 NumWebVisitsMonth 5.278635 Response 0.153343 Complain 0.009350 Recency 49.119682 Tot_Family_Mem 2.344554 Name: 0, dtype: float64 Cluster 1:
| 1 | |
|---|---|
| index | 1.000000 |
| Total_Spending | 81.796296 |
| Total_Num_Purchases | 5.222222 |
| Amount_Spent_Per_Purchase | 12.854030 |
| Customer_Tenure | 4267.203704 |
| Age | 47.537037 |
| Total_Kids | 0.722222 |
| Single | 0.370370 |
| Well_Educated | 0.000000 |
| Income | 20306.259259 |
| NumDealsPurchases | 1.796296 |
| NumWebPurchases | 1.888889 |
| NumCatalogPurchases | 0.481481 |
| NumStorePurchases | 2.851852 |
| NumWebVisitsMonth | 6.870370 |
| Response | 0.037037 |
| Complain | 0.000000 |
| Recency | 48.444444 |
| Tot_Family_Mem | 2.092593 |
DBSCAN Cluster 0: This cluster exhibits a higher average total spending, total number of purchases, and amount spent per purchase compared to Cluster 1. Members of this cluster also tend to have a higher income and more web purchases. They also show a slightly higher propensity to complain.
DBSCAN Cluster 1: This cluster shows lower average spending, total purchases, and amount spent per purchase. Cluster members in this group tend to have lower incomes and fewer web purchases than those in Cluster 0. They also appear less likely to complain.
Gaussian Mixture Model¶
Visualize the clusters using PCA¶
Cluster Profiling¶
# Find the optimal number of components for GMM using BIC
n_components_range = range(2, 11)
lowest_bic = np.infty
bic = []
for n_components in n_components_range:
# Fit a Gaussian mixture with n_components components
gmm = GaussianMixture(n_components=n_components, random_state=42)
gmm.fit(scaled_segmentation_df)
bic.append(gmm.bic(scaled_segmentation_df))
if bic[-1] < lowest_bic:
lowest_bic = bic[-1]
best_gmm = gmm
best_n_components = bic.index(lowest_bic) + 2 # Adjusted index for correct number of components
print(f"Optimal number of components: {best_n_components}")
# Fit the GMM model with optimal number of components
gmm = GaussianMixture(n_components=best_n_components, random_state=42)
gmm_labels = gmm.fit_predict(scaled_segmentation_df)
#Now use 'gmm_labels' for further analysis like cluster profiling
# ### **Apply GMM for the best hyperparameter and visualize the clusters from PCA**
# Apply PCA for dimensionality reduction
pca = PCA(n_components=2, random_state=42)
pca_result = pca.fit_transform(scaled_segmentation_df)
# Visualize the clusters
plt.figure(figsize=(10, 8))
for cluster in np.unique(gmm_labels):
plt.scatter(pca_result[gmm_labels == cluster, 0], pca_result[gmm_labels == cluster, 1], label=f'Cluster {cluster}')
plt.title('GMM Clustering with PCA Visualization')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend()
plt.show()
# Calculate cluster profiles for GMM clustering
cluster_profiles_gmm = df.groupby(gmm_labels).agg({
'Total_Spending': 'mean',
'Total_Num_Purchases': 'mean',
'Amount_Spent_Per_Purchase': 'mean',
'Customer_Tenure': 'mean',
'Age': 'mean',
'Total_Kids': 'mean',
'Single': 'mean',
'Well_Educated': 'mean',
'Income': 'mean',
'NumDealsPurchases': 'mean',
'NumWebPurchases': 'mean',
'NumCatalogPurchases': 'mean',
'NumStorePurchases': 'mean',
'NumWebVisitsMonth': 'mean',
'Response': 'mean',
'Complain': 'mean',
'Recency': 'mean',
'Tot_Family_Mem': 'mean'
}).reset_index()
# Plotting cluster profiles for GMM clustering
num_features = len(cluster_profiles_gmm.columns) - 1
num_rows = int(np.ceil(num_features / 3))
num_cols = 3
fig, axes = plt.subplots(num_rows, num_cols, figsize=(18, num_rows * 5))
axes = axes.flatten()
for i, feature in enumerate(cluster_profiles_gmm.columns[1:]):
sns.barplot(x='index', y=feature, data=cluster_profiles_gmm, ax=axes[i])
axes[i].set_title(f'Average {feature} per Cluster (GMM)')
for i in range(num_features, num_rows * num_cols):
fig.delaxes(axes[i])
plt.tight_layout()
plt.show()
Optimal number of components: 10
Observations and Insights: The Gaussian Mixture Model (GMM) identified 10 as the optimal number of components based on the Bayesian Information Criterion (BIC). The PCA visualization shows 10 clusters that attempt to follow the aforementioned stripes. Cluster profiles reveal key differences between these segments:
- The usual values like Age, Tenure and Recency show little effect.
- Total Family Members show minimal effect just like previous clustering.
- The rest of the features show lots of effect with many correlating with the others.
- DBSCAN still takes the win for best clustering. GMM is next in line for the best spot since it captures those stripes to some degree.
Further analysis of the cluster profiles should provide actionable insights into the characteristics of these customer segments. Further data engineering may be required to capture those stripes.
Characteristics of each cluster¶
Summary of each cluster: Cluster 0: Likely a moderate spending group, with average purchase frequency and income.
Cluster 1: Potentially a high-value customer segment with above-average spending and purchase frequency. May be more responsive to marketing campaigns and prefer specific channels.
Cluster 2: Could represent a lower spending group, potentially less engaged with the brand. Demographics and purchasing channel preferences could reveal differences in their behavior.
Cluster 3: Likely another moderate-spending group, but with a distinct profile from Cluster 0. A comparison of these two clusters across features would reveal specific differences.
Cluster 4: Not a very niche segment with unique characteristics. Further investigation into its demographics, spending patterns, and preferences to fully characterize it by gathering more data.
Cluster 5: May be similar to Cluster 2 or 3, but with subtle differences. A side-by-side comparison would help.
Cluster 6: Could represent another high-value segment, potentially distinct from Cluster 1 in specific purchasing behavior or demographics.
Cluster 7: Might represent a segment with a low frequency of purchases but moderate or high spending per transaction. Further investigation would help.
Cluster 8: Potentially a small and unique segment. Deeper analysis could better define its characteristics definitively.
Cluster 9: Another likely moderate-spending group. Comparison with other moderate-spending clusters will distinguish its profile.
Overall the clusters are not as defined as with DBSCAN.
Conclusion and Recommendations¶
1. Comparison of various techniques and their relative performance based on chosen Metric (Measure of success):
- How do different techniques perform? Which one is performing relatively better? Is there scope to improve the performance further?
2. Refined insights:
- What are the most meaningful insights from the data relevant to the problem?
3. Proposal for the final solution design:
- What model do you propose to be adopted? Why is this the best solution to adopt?
1. Comparison of various techniques and their relative performance based on chosen Metric (Measure of success):
The notebook explores five clustering techniques: K-Means, K-Medoids, Hierarchical, DBSCAN, and Gaussian Mixture Models (GMM). While the project prompt doesn't explicitly state the primary success metric, the visual inspection of cluster separations in PCA-reduced space, coupled with the silhouette score, suggests that DBSCAN performed best. The notebook demonstrates that DBSCAN achieved the highest silhouette score (around 0.575) and effectively separated the data into two distinct clusters as confirmed visually. GMM identified 10 clusters, capturing more granular groupings visually, but the overall cluster definitions were not as distinct or well-separated as DBSCAN. K-Means and K-medoids were not as successful in capturing the data's structure visually and numerically, according to the clustering profiles, and the notebook didn't provide their numerical results. Performance generally could be further improved by exploring a wider range of hyperparameters for each algorithm, especially for DBSCAN, but the provided expanded parameter search didn't improve results much, and potentially by feature engineering to address the visually apparent "stripes" in the PCA data that current methods don't fully capture.
2. Refined insights:
The most meaningful insights revolve around the customer segmentation revealed by the clustering. The two primary clusters identified by DBSCAN show distinct characteristics:
- High-value customers (DBSCAN Cluster 0): Higher spending, more frequent purchases, higher income, and more web purchases. Slightly higher propensity to complain.
- Lower-value customers (DBSCAN Cluster 1): Lower spending, fewer purchases, lower income, and fewer web purchases. Less likely to complain.
Similar trends were noted, but with less defined separation, in the GMM clusters. While the project briefly discusses other potential characteristics based on the GMM results, the two main clusters identified by DBSCAN provided the clearest insights. Further analysis of specific marketing campaigns on these two primary clusters would be necessary to validate their distinct response behavior. The visual observation of "stripes" in the data suggests additional, more granular segments may exist.
3. Proposal for the final solution design:
The proposed final solution is to use DBSCAN for customer segmentation due to its superior performance and clear cluster separation. While the GMM model showed some promise by capturing visually-apparent finer grain groupings, DBSCAN yielded the clearest high-level segmentation, offering more immediately actionable insights into two primary customer groups with distinct spending and purchasing patterns. Improvements can be made via further hyperparameter tuning and, more importantly, through extensive feature engineering to address the visible striping, allowing for the modeling to capture more granular customer subgroups. This should further enhance the value derived from the customer segmentation process.